794 Matching Annotations
  1. Aug 2024
    1. AbstractBackground MOB typing is a classification scheme that classifies plasmid genomes based on their relaxase gene. The host range of plasmids of different MOB categories are diverse and MOB typing is crucial for investigating the mobilization of plasmid, especially the transmission of resistance genes and virulence factors. However, MOB typing of plasmid metagenomic data is challenging due to the highly fragmented characteristic of metagenomic contigs.Results We developed MOBFinder, an 11-class classifier to classify the plasmid fragments into 10 MOB categories and a non-mobilizable category. We first performed the MOB typing for classifying complete plasmid genomes using the relaxes information, and constructed the artificial benchmark plasmid metagenomic fragments from these complete plasmid genomes whose MOB types are well annotated. Based on natural language models, we used the word vector to characterize the plasmid fragments. Several random forest classification models were trained and integrated for predicting plasmid fragments with different lengths. Evaluating the tool over the benchmark dataset, MOBFinder demonstrates higher performance compared to the existing tool, with an overall accuracy of approximately 59% higher than the MOB-suite. Moreover, the balanced accuracy, harmonic mean and F1-score could reach 99% in some MOB types. In an application focused on a T2D cohort, MOBFinder offered insights suggesting that the MOBF type might accelerate the antibiotic resistance transmission in patients suffering from T2D.Conclusions To the best of our knowledge, MOBFinder is the first tool for MOB tying for plasmid metagenomic fragments. MOBFinder is freely available at https://github.com/FengTaoSMU/MOBFinder.

      A version of this preprint has been published in the Open Access journal GigaScience (see paper https://doi.org/10.1093/gigascience/giae047), where the paper and peer reviews are published openly under a CC-BY 4.0 license. These peer reviews were as follows:

      **Reviewer 1: Haruo Suzuki **

      I recommend that the authors consider revising based on the following points.

      1. the unpaired Wilcoxon signed-rank two-sided test. -> should be corrected to either "Wilcoxon rank-sum test" or "Mann-Whitney U test"

      https://en.wikipedia.org/wiki/Mann%E2%80%93Whitney_U_test "Wilcoxon rank-sum test" redirects here. For Wilcoxon signed-rank test, see Wilcoxon signed-rank test. https://en.wikipedia.org/wiki/Wilcoxon_signed-rank_test Not to be confused with Wilcoxon rank-sum test.

      1. Since MOBscan can only predict the MOB type with plasmid proteins, we annotated the plasmids in the test set with Prokka, then manually submitted them to the MOBscan website for MOB type annotation.

      Given that MOBScan operates as an online tool and cannot be executed locally, the calculation of MOBScan's run time was confined to the duration spent on preprocessing with Prokka locally." (Please refer to Line 313-319 in the revised manuscript).

      -> Actually, it can be executed locally using the scripts included in https://github.com/santirdnd/COPLA/. It may not be necessary to run MOBscan locally (it may be okay that they manually submitted them to the MOBscan website), but I'll inform you regardless.

      1. In the comparison, it was observed that MOBscan did not perform well, achieving low accuracy and kappa values across sequences of varying lengths, while MOB-suite exhibited marginally better performance than MOBscan when handling sequences of greater length (Figure 3A, 3B). (Please refer to Line 418-421 in the revised manuscript).

      -> Do the authors' results contradict the following general expectation? MOB-typer utilizes BLAST, whereas MOBscan utilizes hmmscan, and therefore, MOBscan is expected to retrieve more distantly related proteins than MOB-typer.

      1. MOB-suit and MOBscan are represented by blue lines, orange lines and gray lines respectively. -> should be "MOB-suite"

      2. I suggest receiving English language editing before publishing the paper. "For the MOB typing, MOBscan [18] uses the HMMER model to annotated the relaxases and further perform MOB typing." -> should be "For the MOB typing, MOBscan [18] uses the HMMER model to annotate the relaxases and further perform MOB typing."

    2. AbstractBackground MOB typing is a classification scheme that classifies plasmid genomes based on their relaxase gene. The host range of plasmids of different MOB categories are diverse and MOB typing is crucial for investigating the mobilization of plasmid, especially the transmission of resistance genes and virulence factors. However, MOB typing of plasmid metagenomic data is challenging due to the highly fragmented characteristic of metagenomic contigs.Results We developed MOBFinder, an 11-class classifier to classify the plasmid fragments into 10 MOB categories and a non-mobilizable category. We first performed the MOB typing for classifying complete plasmid genomes using the relaxes information, and constructed the artificial benchmark plasmid metagenomic fragments from these complete plasmid genomes whose MOB types are well annotated. Based on natural language models, we used the word vector to characterize the plasmid fragments. Several random forest classification models were trained and integrated for predicting plasmid fragments with different lengths. Evaluating the tool over the benchmark dataset, MOBFinder demonstrates higher performance compared to the existing tool, with an overall accuracy of approximately 59% higher than the MOB-suite. Moreover, the balanced accuracy, harmonic mean and F1-score could reach 99% in some MOB types. In an application focused on a T2D cohort, MOBFinder offered insights suggesting that the MOBF type might accelerate the antibiotic resistance transmission in patients suffering from T2D.Conclusions To the best of our knowledge, MOBFinder is the first tool for MOB tying for plasmid metagenomic fragments. MOBFinder is freely available at https://github.com/FengTaoSMU/MOBFinder.

      A version of this preprint has been published in the Open Access journal GigaScience (see paper https://doi.org/10.1093/gigascience/giae047), where the paper and peer reviews are published openly under a CC-BY 4.0 license. These peer reviews were as follows:

      Reviewer 2: Dan Wang

      The manuscript provides a comprehensive background on the necessity and challenges of MOB typing in the context of plasmid genomics and its significance in tracking the transmission of resistance genes and virulence factors. The innovation introduced by MOBFinder, which incorporates an 11-class classification system, addresses a critical gap in current research methodologies by enhancing the precision of plasmid fragment classification. Key Strengths: Innovation: MOBFinder represents a novel approach in the typing of metagenomic plasmid fragments using word vector characterization combined with machine learning techniques. Methodological Rigor: The methodological approach, including the use of random forest models and the construction of a benchmark dataset from annotated complete plasmid genomes, is robust and well-executed. Performance: The tool demonstrates superior performance compared to existing tools like MOBscan and MOB-suite, providing a significant improvement in accuracy. Impact on Field: The application of MOBFinder in a T2D cohort illustrates the tool's practical utility and its potential to influence antibiotic resistance studies. Recommendation: Given the thorough revisions and the contributions this manuscript offers to the field of microbial genomics and antibiotic resistance, I recommend that the manuscript be accepted for publication in GigaScience.

    1. AbstractA large range of sophisticated brain image analysis tools have been developed by the neuroscience community, greatly advancing the field of human brain mapping. Here we introduce the Computational Anatomy Toolbox (CAT) – a powerful suite of tools for brain morphometric analyses with an intuitive graphical user interface, but also usable as a shell script. CAT is suitable for beginners, casual users, experts, and developers alike providing a comprehensive set of analysis options, workflows, and integrated pipelines. The available analysis streams – illustrated on an example dataset – allow for voxel-based, surface-based, as well as region-based morphometric analyses. Notably, CAT incorporates multiple quality control options and covers the entire analysis workflow, including the preprocessing of cross-sectional and longitudinal data, statistical analysis, and the visualization of results. The overarching aim of this article is to provide a complete description and evaluation of CAT, while offering a citable standard for the neuroscience community.

      A version of this preprint has been published in the Open Access journal GigaScience (see paper https://doi.org/10.1093/gigascience/giae049), where the paper and peer reviews are published openly under a CC-BY 4.0 license. These peer reviews were as follows:

      Reviewer 3: Cyril Pernet

      CAT has been around for a long time and is a well maintained toolbox - the paper describes all the features and additionally provides tests/validations of those features. I have left a few comments on the pdf (uploaded) which I don't see has mandatory and thus 'accepted' the paper (and leave the authors to decide what to do with those comments). It provides a nice reference for the toolbox.

    2. AbstractA large range of sophisticated brain image analysis tools have been developed by the neuroscience community, greatly advancing the field of human brain mapping. Here we introduce the Computational Anatomy Toolbox (CAT) – a powerful suite of tools for brain morphometric analyses with an intuitive graphical user interface, but also usable as a shell script. CAT is suitable for beginners, casual users, experts, and developers alike providing a comprehensive set of analysis options, workflows, and integrated pipelines. The available analysis streams – illustrated on an example dataset – allow for voxel-based, surface-based, as well as region-based morphometric analyses. Notably, CAT incorporates multiple quality control options and covers the entire analysis workflow, including the preprocessing of cross-sectional and longitudinal data, statistical analysis, and the visualization of results. The overarching aim of this article is to provide a complete description and evaluation of CAT, while offering a citable standard for the neuroscience community.

      A version of this preprint has been published in the Open Access journal GigaScience (see paper https://doi.org/10.1093/gigascience/giae049), where the paper and peer reviews are published openly under a CC-BY 4.0 license. These peer reviews were as follows:

      Reviewer 2: Chris Foulon

      Overall, I think the CAT software provides valuable tools to analyse morphometric differences in the brain and promotes open science. The study shows the software's capabilities rather well. However, I think some clarifications would help the readers understand and evaluate the quality of the methods.

      Comments: Figure 2: Looking at the chart, I have a question regarding the pipeline. Is it required to run the whole pipeline using CAT? Or is it possible to input already registered data to start directly with the VBM analysis or further?

      Voxel-based Processing: The above question is quite important, seeing that the preprocessing uses rather old registration methods. The users might want to use more recent registration methods, especially with clinical populations.

      Spatial Registration and Figure 3: For the registration, how is the registration performing with clinical populations (e.g. stroke patients)? It can be significant for the applicability of the methods with specific disorders.

      Surface Registration and Figure 3: What type of noise is used to evaluate the accuracy? This can be important as not every noise can be modelled easily, and some noises are more or less pronounced depending on the modality.

      Maybe having the letters of the figure panels referred to in the text would help the reader.

      Performance of CAT: Although I see the advantage of using simulated data, I think it would require more explanation. First, what tells the reader the quality of this simulated data, and how does it compare to real data? Second, is it only healthy data? In that case, the accuracy evaluation might not be relevant for the majority of the clinical studies using CAT.

      Longitudinal Processing: Are VBM analyses sensitive enough to capture changes over days? I would be surprised, but I would be interested to see studies doing it (and the readers would also benefit from it, I reckon).

      Mapping onto the Cortical Surface: I am a bit confused about the interest in mapping functional or diffusion parameters to the surface. Do you have examples of articles doing that? It sounds like it would waste a lot of information from these parameters, but I am not familiar with this type of analysis. "Optionally, CAT also allows mapping of voxel values at multiple positions along the surface normal at each node". I do not understand this sentence; I think it should be clarified.

      Example application: Is there a way to come back from the surface space to the volume space to compare the results? For example, VBM and SBM should provide fairly similar results, but comparing them is difficult when they are not in the same space. Additionally, in the end, the surface representation is just that, a representation; most other analyses are still done on the volume space, so it could be helpful to translate the result on the surface back to the volume (if it is not already available).

      Evaluation of CAT12: I was confused with Supplemental Figure 1 as it is not mentioned in the caption that it is the AD data and not the simulated one. Maybe it would help the reader to mention it.

      Regarding the reliability of CAT12, it seems to capture more things, but I struggle to see how we can be sure that this is "better" than other methods; couldn't it be false positives?

      "those achieved based on manual tracing and demonstrated that both approaches produced comparable hippocampal volume." comparable volumes do not really mean the same accuracy; this sentence could be misleading.

      I think the multiple studies show that CAT12 is as valid as any other tool but I am not sure the argument that it is better is as solid. Of course, I understand that there is no ground truth for what a relevant morphological change is for a given disease.

      Methods: Statistical Analysis: Why is the FWER correction used for the voxel-wise statistics (which perform many comparisons) and FDR used on ROI-wise statistics (which perform much fewer comparisons)? I would expect the opposite.

      "The outcomes of the VBM and voxel-based ROI analyses were overlaid onto orthogonal sections of the mean brain created from the entire study sample (n=50); " I don't understand what this refers to.

    3. AbstractA large range of sophisticated brain image analysis tools have been developed by the neuroscience community, greatly advancing the field of human brain mapping. Here we introduce the Computational Anatomy Toolbox (CAT) – a powerful suite of tools for brain morphometric analyses with an intuitive graphical user interface, but also usable as a shell script. CAT is suitable for beginners, casual users, experts, and developers alike providing a comprehensive set of analysis options, workflows, and integrated pipelines. The available analysis streams – illustrated on an example dataset – allow for voxel-based, surface-based, as well as region-based morphometric analyses. Notably, CAT incorporates multiple quality control options and covers the entire analysis workflow, including the preprocessing of cross-sectional and longitudinal data, statistical analysis, and the visualization of results. The overarching aim of this article is to provide a complete description and evaluation of CAT, while offering a citable standard for the neuroscience community.

      A version of this preprint has been published in the Open Access journal GigaScience (see paper https://doi.org/10.1093/gigascience/giae049), where the paper and peer reviews are published openly under a CC-BY 4.0 license. These peer reviews were as follows:

      Reviewer 1: Chris Armit

      This Technical Note describes the Computational Anatomy Toolbox (CAT) software tool, which includes a Graphical User Interface that can be used for morphometric analysis of Structural MRI data. The CAT software tool is impressive, and enables voxel-based and surface-based morphometric analysis to be accomplished on Structural MRI data, and also voxel-based tissue segmentation and surface mesh generation to be applied to these 3D imaging datasets. The authors helpfully illustrate the utility of the Computational Anatomy Toolbox (CAT) using T1-weighted structural brain images from the Alzheimer's Disease Neuroimaging Initiative (ADNI) database.

      This is an excellent, freely available tool for the Neuroimaging community and the authors are to be commended for developing this impressive software tool.

      Minor comments

      I first attempted to launch the CAT software tool on macOS 14.0 (Sonoma) with Apple M1 chip, and on the command line I received the following message: "spm12" is damaged and can't be opened. You should move it to the Bin.

      I additionally tested the CAT software tool on macOS 12.6 (Monterey) with Intel chip, and I was able to run the CAT software tool on this platform.

      A minor criticism is that the installation instructions in the supporting Readme file for archive [CAT12.9_R2023b_MCR_Mac_arm64.zip], which runs on macOS with Intel chip, only details how to install the SPM (Statistical Parametric Mapping) software tool. The CAT software tool needs to be downloaded separately and then moved into the directory of the SPM toolbox, and these installation instructions are included in the supporting CAT software documentation (https://neuro-jena.github.io/cat12-help/#get_started)

      With the issues I encountered in installation, I invite the authors to list the System Requirements - specifically the Operating Systems that are needed to run the CAT software tool - in the GigaScience manuscript and also in the supporting CAT software documentation.

      In addition, it would be particularly helpful if the instructions on how to install CAT in the context of SPM were included in the supporting Readme files for the Computational Anatomy Toolbox (CAT) zip archives.

    1. AbstractBackground Visualization is an indispensable facet of genomic data analysis. Despite the abundance of specialized visualization tools, there remains a distinct need for tailored solutions. However, their implementation typically requires extensive programming expertise from bioinformaticians and software developers, especially when building interactive applications. Toolkits based on visualization grammars offer a more accessible, declarative way to author new visualizations. Nevertheless, current grammar-based solutions fall short in adequately supporting the interactive analysis of large data sets with extensive sample collections, a pivotal task often encountered in cancer research.Results We present GenomeSpy, a grammar-based toolkit for authoring tailored, interactive visualizations for genomic data analysis. Users can implement new visualization designs with little effort by using combinatorial building blocks that are put together with a declarative language. These fully customizable visualizations can be embedded in web pages or end-user-oriented applications. The toolkit also includes a fully customizable but user-friendly application for analyzing sample collections, which may comprise genomic and clinical data. Findings can be bookmarked and shared as links that incorporate provenance information. A distinctive element of GenomeSpy’s architecture is its effective use of the graphics processing unit (GPU) in all rendering. GPU usage enables a high frame rate and smoothly animated interactions, such as navigation within a genome. We demonstrate the utility of GenomeSpy by characterizing the genomic landscape of 753 ovarian cancer samples from patients in the DECIDER clinical trial. Our results expand the understanding of the genomic architecture in ovarian cancer, particularly the diversity of chromosomal instability. We also show how GenomeSpy enabled the discovery of clinically actionable genomic aberrations.Conclusions GenomeSpy is a visualization toolkit applicable to a wide range of tasks pertinent to genome analysis. It offers high flexibility and exceptional performance in interactive analysis. The toolkit is open source with an MIT license, implemented in JavaScript, and available at https://genomespy.app/.

      A version of this preprint has been published in the Open Access journal GigaScience (see paper https://doi.org/10.1093/gigascience/giae040), where the paper and peer reviews are published openly under a CC-BY 4.0 license. These peer reviews were as follows:

      Reviewer 3: Luca Beltrame

      Lavikka and coworkers present an interesting visualization framework and associated application for genomics visualization. The challenges outlined by the authors in finding appropriate visualization tools for large-scale genomics data were also experienced by this reviewer, and thus better and improved tools are always welcome.

      The manuscript is well laid out, presenting the key facts in a proper manner. The use of GPU rendering for graphs is an excellent move, and I expect to be extremely useful even for machines with lower-end GPUs. The code looks reasonably written and commented (being an application, this too is important for a review). I have also tested the examples, and indeed the software is very useful (the documentation should, however, point out that some issues regarding saving the canvas still exist). One may argue that the use of JSON for the graph grammar can be awkward, but at the same time other file formats may be more problematic and/or require specialized parsers (which open yet another can of worms).

      Documentation is also logically organized. As a minor suggestion, the authors may want to add some form of search to their documentation page.

      There are is an open questions that the authors may want to answer: they explicitly mention GISTIC 1.0 for the G-score plots. Is there a specific reason why they chose 1.0? The 2.0 algorithm is far more robust and produces more reliable results.

    2. AbstractBackground Visualization is an indispensable facet of genomic data analysis. Despite the abundance of specialized visualization tools, there remains a distinct need for tailored solutions. However, their implementation typically requires extensive programming expertise from bioinformaticians and software developers, especially when building interactive applications. Toolkits based on visualization grammars offer a more accessible, declarative way to author new visualizations. Nevertheless, current grammar-based solutions fall short in adequately supporting the interactive analysis of large data sets with extensive sample collections, a pivotal task often encountered in cancer research.Results We present GenomeSpy, a grammar-based toolkit for authoring tailored, interactive visualizations for genomic data analysis. Users can implement new visualization designs with little effort by using combinatorial building blocks that are put together with a declarative language. These fully customizable visualizations can be embedded in web pages or end-user-oriented applications. The toolkit also includes a fully customizable but user-friendly application for analyzing sample collections, which may comprise genomic and clinical data. Findings can be bookmarked and shared as links that incorporate provenance information. A distinctive element of GenomeSpy’s architecture is its effective use of the graphics processing unit (GPU) in all rendering. GPU usage enables a high frame rate and smoothly animated interactions, such as navigation within a genome. We demonstrate the utility of GenomeSpy by characterizing the genomic landscape of 753 ovarian cancer samples from patients in the DECIDER clinical trial. Our results expand the understanding of the genomic architecture in ovarian cancer, particularly the diversity of chromosomal instability. We also show how GenomeSpy enabled the discovery of clinically actionable genomic aberrations.Conclusions GenomeSpy is a visualization toolkit applicable to a wide range of tasks pertinent to genome analysis. It offers high flexibility and exceptional performance in interactive analysis. The toolkit is open source with an MIT license, implemented in JavaScript, and available at https://genomespy.app/.

      A version of this preprint has been published in the Open Access journal GigaScience (see paper https://doi.org/10.1093/gigascience/giae040), where the paper and peer reviews are published openly under a CC-BY 4.0 license. These peer reviews were as follows:

      **Reviewer 2: Alessandro Romanel **

      In this article, the authors introduce GenomeSpy, a grammar-based toolkit for creating customized, interactive visualizations for genomic data analysis. I find the article extremely interesting, and I believe the framework introduced by the authors has broad utility. The website is well-maintained and documented, and I particularly found the examples mentioned in the paper to be useful and informative. The authors chose to present their toolkit by narrating the navigation of a dataset generated in the DECIDER study. While the narrative makes the utility of the visualizations clear in data interpretation, what is not clear at all is how easy it is to use GenomeSpy to create those same visualizations. I believe that the success of a toolkit like this is strongly tied to its ease of use, and this aspect is not clear or prominently highlighted in the manuscript. Additionally, it would be interesting to more clearly highlight GenomeSpy's strengths compared to other approaches. By combining Rshiny and ggplot, it is indeed possible to create complex interactive data visualizations. Therefore, it would be necessary to more strongly emphasize what the other innovative aspects of GenomeSpy are, beyond GPU acceleration, compared to other approaches available today.

    3. AbstractBackground Visualization is an indispensable facet of genomic data analysis. Despite the abundance of specialized visualization tools, there remains a distinct need for tailored solutions. However, their implementation typically requires extensive programming expertise from bioinformaticians and software developers, especially when building interactive applications. Toolkits based on visualization grammars offer a more accessible, declarative way to author new visualizations. Nevertheless, current grammar-based solutions fall short in adequately supporting the interactive analysis of large data sets with extensive sample collections, a pivotal task often encountered in cancer research.Results We present GenomeSpy, a grammar-based toolkit for authoring tailored, interactive visualizations for genomic data analysis. Users can implement new visualization designs with little effort by using combinatorial building blocks that are put together with a declarative language. These fully customizable visualizations can be embedded in web pages or end-user-oriented applications. The toolkit also includes a fully customizable but user-friendly application for analyzing sample collections, which may comprise genomic and clinical data. Findings can be bookmarked and shared as links that incorporate provenance information. A distinctive element of GenomeSpy’s architecture is its effective use of the graphics processing unit (GPU) in all rendering. GPU usage enables a high frame rate and smoothly animated interactions, such as navigation within a genome. We demonstrate the utility of GenomeSpy by characterizing the genomic landscape of 753 ovarian cancer samples from patients in the DECIDER clinical trial. Our results expand the understanding of the genomic architecture in ovarian cancer, particularly the diversity of chromosomal instability. We also show how GenomeSpy enabled the discovery of clinically actionable genomic aberrations.Conclusions GenomeSpy is a visualization toolkit applicable to a wide range of tasks pertinent to genome analysis. It offers high flexibility and exceptional performance in interactive analysis. The toolkit is open source with an MIT license, implemented in JavaScript, and available at https://genomespy.app/.

      A version of this preprint has been published in the Open Access journal GigaScience (see paper https://doi.org/10.1093/gigascience/giae040), where the paper and peer reviews are published openly under a CC-BY 4.0 license. These peer reviews were as follows:

      Reviewer 1: Andrea Sboner

      In this manuscript, the authors present Genome Spy, a visualization toolkit geared toward the rapid and interactive exploration of genomic features. They demonstrate how this tool can help investigators explore a large cohort of 753 ovarian cancers sequenced by whole-genome sequencing (WGS). By using the tool, they were able to identify outliers in the dataset and refine their diagnosis. The tool is inspired by Vega-lite, a high-level grammar for interactive graphics, and extends it for genomic applications.

      The manuscript is clearly written, and the authors provide links to the applications itself, tutorials and examples. I want to commend them for doing this. This is a tool that would nicely complement others and has a specific advantage of using high-performance GPUs that are now common in modern computers.

      The only concern that I have is about a couple of claims that may not be fully supported by the data provided: 1. Claim: users can implement new visualization designs easily. While the grammar certainly enables the users to define new designs, I do not think that this is necessarily easy, as the authors themselves recognize in the discussion section when they suggest providing templates to reduce the learning curve. Indeed, the example in Figure 2 is still quite verbose and would need some time for anyone to understand the syntax and the style. The playground web application facilitates testing it, though. 2. Claim: the grammar-based approach allows to be mixed and matched. I did not find any specific example of how to do this. It would have been quite interesting to see the intersection between the DNA representation of structural variants and RNA-seq data (if this is what it means as "mix and match").

    1. AbstractBackground Sequencing of SARS-CoV-2 RNA from wastewater samples has emerged as a valuable tool for detecting the presence and relative abundances of SARS-CoV-2 variants in a community. By analyzing the viral genetic material present in wastewater, public health officials can gain early insights into the spread of the virus and inform timely intervention measures. The construction of reference datasets from known SARS-CoV-2 lineages and their mutation profies has become state-of-the-art for assigning viral lineages and their relative abundances from wastewater sequencing data. However, the selection of reference sequences or mutations directly affects the predictive power.Results Here, we show the impact of a mutation- and sequence-based reference reconstruction for SARS-CoV-2 abundance estimation. We benchmark three data sets: 1) synthetic “spike-in” mixtures, 2) German samples from early 2021, mainly comprising Alpha, and 3) samples obtained from wastewater at an international airport in Germany from the end of 2021, including 1rst signals of Omicron. The two approaches differ in sub-lineage detection, with the marker-mutation-based method, in particular, being challenged by the increasing number of mutations and lineages. However, the estimations of both approaches depend on selecting representative references and optimized parameter settings. By performing parameter escalation experiments, we demonstrate the effects of reference size and alternative allele frequency cutoffs for abundance estimation. We show how different parameter settings can lead to different results for our test data sets, and illustrate the effects of virus lineage composition of wastewater samples and references.Conclusions Here, we compare a mutation- and sequence-based reference construction and assignment for SARS-CoV-2 abundance estimation from wastewater samples. Our study highlights current computational challenges, focusing on the general reference design, which significantly and directly impacts abundance allocations. We illustrate advantages and disadvantages that may be relevant for further developments in the wastewater community and in the context of higher standardization.

      A version of this preprint has been published in the Open Access journal GigaScience (see paper https://doi.org/10.1093/gigascience/giae051), where the paper and peer reviews are published openly under a CC-BY 4.0 license. These peer reviews were as follows:

      **Reviewer 2: Liuyang Zhao **

      In this study, the authors initiate a novel exploration by employing parameter escalation experiments to assess the impact of reference size and alternative allele frequency cutoffs on the effects of virus lineage composition in wastewater samples and their references. The research provides valuable insights into how different parameter settings influence outcomes in test data sets, particularly highlighting the role of virus lineage composition in wastewater samples and the corresponding references. Detailed parameters for these analyses are made available in several bash files at osf.io/upbqj. Despite these significant contributions, certain areas could benefit from further enhancement:

      1.The current methodology utilizes Ion Torrent for testing mock samples. However, this approach may not fully capture the variability in alignment and sub-lineage analysis. Incorporating additional sequencing data from PacBio, Nanopore, and Illumina would offer a more comprehensive examination of these aspects, potentially leading to more robust findings.

      2.While the study showcases a variety of pipelines based on mutation-based and sequence-based tools in Table 1, the evaluation of three data sets was limited to only using MAMUSS (as a mutation-based reference) and VLQ-nf (as a sequence-based reference). For more conclusive guidance in pipeline selection, it is advisable for the authors to expand their analysis to include at least two or three more pipelines. This recommendation aligns with observations noted by the authors at line 619, suggesting a comprehensive benchmark comparison would significantly enhance the study's utility and appeal to readers seeking optimal pipeline strategies.

    2. AbstractBackground Sequencing of SARS-CoV-2 RNA from wastewater samples has emerged as a valuable tool for detecting the presence and relative abundances of SARS-CoV-2 variants in a community. By analyzing the viral genetic material present in wastewater, public health officials can gain early insights into the spread of the virus and inform timely intervention measures. The construction of reference datasets from known SARS-CoV-2 lineages and their mutation profies has become state-of-the-art for assigning viral lineages and their relative abundances from wastewater sequencing data. However, the selection of reference sequences or mutations directly affects the predictive power.Results Here, we show the impact of a mutation- and sequence-based reference reconstruction for SARS-CoV-2 abundance estimation. We benchmark three data sets: 1) synthetic “spike-in” mixtures, 2) German samples from early 2021, mainly comprising Alpha, and 3) samples obtained from wastewater at an international airport in Germany from the end of 2021, including 1rst signals of Omicron. The two approaches differ in sub-lineage detection, with the marker-mutation-based method, in particular, being challenged by the increasing number of mutations and lineages. However, the estimations of both approaches depend on selecting representative references and optimized parameter settings. By performing parameter escalation experiments, we demonstrate the effects of reference size and alternative allele frequency cutoffs for abundance estimation. We show how different parameter settings can lead to different results for our test data sets, and illustrate the effects of virus lineage composition of wastewater samples and references.Conclusions Here, we compare a mutation- and sequence-based reference construction and assignment for SARS-CoV-2 abundance estimation from wastewater samples. Our study highlights current computational challenges, focusing on the general reference design, which significantly and directly impacts abundance allocations. We illustrate advantages and disadvantages that may be relevant for further developments in the wastewater community and in the context of higher standardization.

      A version of this preprint has been published in the Open Access journal GigaScience (see paper https://doi.org/10.1093/gigascience/giae051), where the paper and peer reviews are published openly under a CC-BY 4.0 license. These peer reviews were as follows:

      **Reviewer 1: Irene Bassano **

      In the manuscript "Impact of reference design on estimating SARS-CoV-2 lineage abundances from wastewater sequencing data" Aßmann et. al compare two methods, a sequence and mutation-based, respectively, to better understand the circulating lineages and sub-lineages in wastewater samples. Since the advent of wastewater-based epidemiology (WBE) as a tool to complement results from clinical data, there has been search for novel tools that can give robustness to the results and more importantly confidence in the data analysis. In this context, this manuscript is very important as it is contributing towards achieving that goal. This is clear in the fact that they have designed a new tool, namely MAMUSS. 1. One aspect however that the manuscript fails to mention is the difficulty in reconstructing full genome sequences from wastewater data. This has been one of the biggest problems since it is widely accepted that viral particles in water do degrade, and consequently what is being sequenced is a partial genome. Consensus sequences are therefore very difficult to obtain. 2. Another aspect that the authors fail to mention in the introduction or as a point of discussion, is how a variant is defined and how we take this information from clinical samples to then adopt it to define variants in environmental samples, although some relevant tools are mentioned such as COJAC and MMMVI. Yet, how these are used, it is not explained. 3. The manuscript is well written, there are some repetitive sentences that need to be removed (see comments on PDF) as well as a couple of sentences which are not grammatically correct (see comments on PDF). 4. It is worth mentioning that the words "variants" and "lineages" are used interchangeably. I do suggest they choose one term only. 5. The manuscript mentions several times the presence of false and true positive, however does not mention how these were calculated. These need to be supported by a small statistical test. 6. There are minor corrections throughout the manuscript that need to be address. All these are highlighted as comments in the original manuscript.

    1. Editors Assessment:

      RAD-Seq (Restriction-site-associated DNA sequencing) is a cost-effective method for single nucleotide polymorphism (SNP) discovery and genotyping. In this study the authors performed a kinship analysis and pedigree reconstruction for two different cattle breeds (Angus and Xiangxi yellow cattle). A total of 975 cattle, including 923 offspring with 24 known sires and 28 known dams, were sampled and subjected to SNP discovery and genotyping using RAD-Seq. Producing a SNP panel with 7305 SNPs capturing the maximum difference between paternal and maternal genome information, and being able to distinguish between the F1 and F2 generation with 90% accuracy. Peer review helped highlight better the practical applications of this work. The combination of the efficiency of RNA-seq and advances in kinship analysis here can helpfully help improve breed management, local resource utilization, and conservation of livestock.

      This evaluation refers to version 1 of the preprint

    2. AbstractKinship and pedigree information, used for estimating inbreeding, heritability, selection, and gene flow, is useful for breeding and animal conservation. However, as the size of the crossbred population increases, inaccurate generation and parentage recoding in livestock farms increases. Restriction-site-associated DNA sequencing (RAD-Seq) is a cost-effective platform for single nucleotide polymorphism (SNP) discovery and genotyping. Here, we performed a kinship analysis and pedigree reconstruction for Angus and Xiangxi yellow cattle, which benefit from good meat quality and yields, providing a basis for livestock management. A total of 975 cattle, including 923 offspring with 24 known sires and 28 known dams, were sampled and subjected to SNP discovery and genotyping. The identified SNPs panel included 7305 SNPs capturing the maximum difference between paternal and maternal genome information allowing us to distinguish between the F1 and F2 generation with 90% accuracy. In addition, parentage assignment software based on different strategies verified that the cross-assignments. In conclusion, we provided a low-cost and efficient SNP panel for kinship analyses and the improvement of local genetic resources, which are valuable for breed improvement, local resource utilization, and conservation.

      This work has been published in GigaByte Journal under a CC-BY 4.0 license (https://doi.org/10.46471/gigabyte.131), and has published the reviews under the same license. These are as follows.

      Reviewer 1. Liyun wan

      Is there sufficient detail in the methods and data-processing steps to allow reproduction?

      The detailed parameters for the SNP and InDel calling should be described to allow reproduction.

      Additional Comments:

      This research provides valuable insights into the use of RAD-Seq to kinship analysis and pedigree reconstruction, which is useful for breeding and animal conservation purposes. Overall, the study is well-conducted and the findings are relevant. However, there are a few aspects that require attention before the manuscript can be considered for publication. Please address the following points: 1. Provide practical applications: Highlight the practical applications of your research in livestock management, breed improvement, local resource utilization, and conservation. Discuss how the low-cost and efficient SNP panel can contribute to these areas and provide suggestions for further research or implementation. 2. Language and clarity: Review the manuscript for clarity, grammar, and sentence structure. Ensure that all key terms and concepts are defined and explained to facilitate understanding for a broad readership. Once these revisions have been made, I believe the manuscript will be much stronger and suitable for publication.

      Reviewer 2. Mohammad Bagher Zandi

      Is the language of sufficient quality?

      Yes. It was great.

      Are all data available and do they match the descriptions in the paper?

      Yes. The raw sequencing reads were deposited but it would be better to share the the SNPs data as well.

      Is the data acquisition clear, complete and methodologically sound?

      No. SNPs detection and SNPs selection for assignment test is not clear.

      Is there sufficient detail in the methods and data-processing steps to allow reproduction?

      No. In some cases, the materials and methods section is vague. It is better to correct them. It is mentioned in the attached manuscript text.

      Additional Comments: Well done research, but the manuscript need some correction as commented on the attached file. See: https://gigabyte-review.rivervalleytechnologies.comdownload-api-file?ZmlsZV9wYXRoPXVwbG9hZHMvZ3gvRFIvNTA1L2dpZ2EtY29tZW50cy5kb2N4

    1. Editors Assessment: This work is part of a series of papers from the Hong Kong Biodiversity Genomics Consortium sequencing the rich biodiversity of species in Hong Kong (see https://doi.org/10.46471/GIGABYTE_SERIES_0006). This example assembles the genome of the black-faced spoonbill (Platalea minor), an emblematic wading bird from East Asia that is classified as globally endangered by the IUCN. This Data Release reporting a 1.24Gb chromosomal-level genome assembly produced using a combination of PacBio SMRT and Omni-C scaffolding technologies. BUSCO and Merqury validation were carried out, gene models created, and peer reviewers also requested MCscan synteny analysis. This showed the genome assembly had high sequence continuity with scaffold length N50=53 Mb. Presenting data from 14 individuals this will hopefully be a useful and valuable resources for future population genomic studies aimed at better understanding spoonbill species numbers and conservation.

      *This evaluation refers to version 1 of the preprint *

    2. AbstractPlatalea minor, the black-faced spoonbill (Threskiornithidae) is a wading bird that is confined to coastal areas in East Asia. Due to habitat destruction, it has been classified by The International Union for Conservation of Nature (IUCN) as globally endangered species. Nevertheless, the lack of its genomic resources hinders our understanding of their biology, diversity, as well as carrying out conservation measures based on genetic information or markers. Here, we report the first chromosomal-level genome assembly of P. minor using a combination of PacBio SMRT and Omni-C scaffolding technologies. The assembled genome (1.24 Gb) contains 95.33% of the sequences anchored to 31 pseudomolecules. The genome assembly also has high sequence continuity with scaffold length N50 = 53 Mb. A total of 18,780 protein-coding genes were predicted, and high BUSCO score completeness (93.7% of BUSCO metazoa_odb10 genes) was also revealed. A total of 6,155,417 bi-allelic SNPs were also revealed from 13 P. minor individuals, accounting for ∼5% of the genome. The resource generated in this study offers the new opportunity for studying the black-faced spoonbill, as well as carrying out conservation measures of this ecologically important spoonbill species.

      This work is part of a series of papers presenting outputs of the Hong Kong Biodiversity Genomics https://doi.org/10.46471/GIGABYTE_SERIES_0006 This work has been published in GigaByte Journal under a CC-BY 4.0 license (https://doi.org/10.46471/gigabyte.130), and has published the reviews under the same license. These are as follows.

      Reviewer 1. Richard Flamio Jr.

      Is the language of sufficient quality?

      No. There are some grammatical errors and spelling mistakes throughout the text.

      Is there sufficient detail in the methods and data-processing steps to allow reproduction?

      Yes. The authors did a phenomenal job at detailing the methods and data-processing steps.

      Additional Comments:

      Very nice job on the paper. The methods are sound and the statistics regarding the genome assembly are thorough. My only two comments are: 1) I think the paper could be improved by the correction of grammatical errors, and 2) I am interested in a discussion about the number of chromosomes expected for this species (or an estimate) based on related species and if the authors believe all of the chromosomes were identified. For example, is the karyotype known or can the researchers making any inferences about the number of microchromosomes in the assembly? Please see a recent paper I wrote on microchromosomes in the wood stork assembly (https://doi.org/10.1093/jhered/esad077) for some ideas in defining the chromosome architecture of the spoonbill and/or comparing this architecture to related species.

      Re-review:

      The authors incorporated the revisions nicely and have produced a quality manuscript. Well done.

      Minor revisions Line 46: A comma is needed after (Threskiornithidae). Line 47: “The” should not be capitalized. Line 48: This should read “as a globally endangered species.” Line 49: “However, the lack of genomic resources for the species hinders the understanding of its biology…” Line 56: Consider changing “also revealed” to “identified” to avoid repetition from the previous sentence. Line 65: Insert “the” before “bird’s.” Lines 69-70: Move “locally” higher in the sentence – “and it is protected locally…” Line 72: Replace “as of to date” with “prior to this study”. Lines 78-79: Pluralize “part.” Line 86: Replace “proceeded” with “processed.” Line 133: “…are listed in Table 1.” Line 158: “accounted” Line 159: “Variant calling was performed using…” Line 161: “Hard filtering was employed…” Lines 200-201: “The heterozygosity levels… from five individuals were comparable to previous reports on spoonbills – black-faced spoonbill … and royal spoonbill … (Li et al. 2022).” Line 202: New sentence. “The remaining heterozygosity levels observed…” Line 206: “…genetic bottleneck in the black-faced spoonbill…” Lines 208-209: “These results highlight the need…” Lines 213-214: “…which are useful and precious resources for future population genomic studies aimed at better understanding spoonbill species numbers and conservation.” Line 226: Missing a period after “heterozygosity.” For references, consider adding DOIs. Some citations have them but most citations would benefit from this addition.

      Reviewer 2. Phred Benham

      Is the language of sufficient quality?

      Generally yes, the language is sufficiently clear. However, a number of places could be refined and extra words removed.

      Are all data available and do they match the descriptions in the paper?

      Additional data is available on figshare.

      I do not see any of the tables that are cited in the manuscript and contain legends. Am I missing something. Also there is no legend for the GenomeScope profile in figure 3.

      The assembly appears to be on genbank as a scaffold level assembly, can you list this accession info in the data availability section in addition to the project number.

      Is there sufficient data validation and statistical analyses of data quality?

      Overall fine, but some additional analyses would aid the paper. Comparison of the spoonbill genome to other close relatives using a synteny plot would be helpful.

      It would also be useful to put heterozygosity and inbreeding coefficients into context by comparing to results from other species.

      Additional Comments:

      Hui et al. report a chromosome level genome for the black-faced spoonbill, a endangered species of coastal wetlands in East Asia. This genome will serve as an important genome for understanding the biology of and conserving this species.

      Generally, the methods are sound and appropriate for the generation of genomic sequence.

      Major comments: This is a highly contiguous genome in line with metrics for Vertebrate Genomics Project genomes and other consortia. The authors argue that they have assembled 31 Pseudo-molecules or chromosomes. It would be nice to see a plot showing synteny of these 31 chromosomes and a closely related species with a chromosome level assembly (e.g. Theristicus caerulescens; GCA_020745775.1)

      The tables appear to be missing from the submitted manuscript?

      Minor comments: Line 49: delete its

      Line 49-51: This sentence is a little awkward, please revise.

      Line 64: delete 'the'

      Line 67: replace 'with' with 'the spoonbil as a'

      Line 68: delete 'Interestingly'

      Line 70: can you be more specific about what kind of genetic methods had previously been performed?

      Line 79: can you provide any additional details on the necessary permits and/or institutional approval

      Line 78: what kind of tissue? or were these blood samples?

      Line 110: do you mean movies?

      Line 143: replace data with dataset

      Line 163: it may be worth applying some additional filters in vcftools, e.g. minor allele freq., min depth, max depth, what level of missing data was allowed?, etc.

      Line 171: delete 'resulted in'

      Line 172: do you mean scaffold L50 was 8? Line 191-195: some context would be useful here, how does this level of heterozygosity and inbreeding compare to other waterbirds?

      Line 217: why did you use the Metazoan database and not the Aves_odb10 database for Busco?

      Figure 1b: Number refers to what, scaffolds? Be consistent with capitalization for Mb. It seems like the order of scaffold N50 and L50 were reversed.

      Figure 3 is missing a legend. Hui et al. report a chromosome level genome for the black-faced spoonbill, a endangered species of coastal wetlands in East Asia. This genome will serve as an important genome for understanding the biology of and conserving this species.

      Generally, the methods are sound and appropriate for the generation of genomic sequence.

      Major comments: This is a highly contiguous genome in line with metrics for Vertebrate Genomics Project genomes and other consortia. The authors argue that they have assembled 31 Pseudo-molecules or chromosomes. It would be nice to see a plot showing synteny of these 31 chromosomes and a closely related species with a chromosome level assembly (e.g. Theristicus caerulescens; GCA_020745775.1)

      The tables appear to be missing from the submitted manuscript?

      Minor comments: Line 49: delete its

      Line 49-51: This sentence is a little awkward, please revise.

      Line 64: delete 'the'

      Line 67: replace 'with' with 'the spoonbil as a'

      Line 68: delete 'Interestingly'

      Line 70: can you be more specific about what kind of genetic methods had previously been performed?

      Line 79: can you provide any additional details on the necessary permits and/or institutional approval

      Line 78: what kind of tissue? or were these blood samples?

      Line 110: do you mean movies?

      Line 143: replace data with dataset

      Line 163: it may be worth applying some additional filters in vcftools, e.g. minor allele freq., min depth, max depth, what level of missing data was allowed?, etc.

      Line 171: delete 'resulted in'

      Line 172: do you mean scaffold L50 was 8? Line 191-195: some context would be useful here, how does this level of heterozygosity and inbreeding compare to other waterbirds?

      Line 217: why did you use the Metazoan database and not the Aves_odb10 database for Busco?

      Figure 1b: Number refers to what, scaffolds? Be consistent with capitalization for Mb. It seems like the order of scaffold N50 and L50 were reversed.

      Figure 3 is missing a legend. Re-review:

      I previously reviewed this manuscript and overall the authors have done a nice job addressing all of my comments.

      I appreciate that the authors include the MCscan analysis that I suggested. However, the alignment of the P. minor assembly and annotations to other genomes suggests rampant mis-assembly or translocations. Birds have fairly high synteny and I would expect Pmin to look more similar to the comparison between T. caerulescens and M. americana in the MCscan plot. For instance, parts of the largest scaffold in the Pmin assembly map to multiple different chromosomes in the Tcae assembly. Similarly, the Z in Tcae maps to 11 different scaffolds in the Pmin assembly and there does not appear to be a single large scaffold in the Pmin assembly that corresponds to the Z chromosome.

      The genome seems to be otherwise of strong quality, so I urge the authors to double-check their MCscan synteny analysis. If this pattern remains, can you please add some comments about it to the end of the Data Validation and Quality Control section? I think other readers will also be surprised at the low levels of synteny apparent between the spoonbill and ibis assemblies.

  2. Jul 2024
    1. Background Xenopus laevis, the African clawed frog, is a versatile vertebrate model organism employed across various biological disciplines, prominently in developmental biology to elucidate the intricate processes underpinning body plan reorganization during metamorphosis. Despite its widespread utility, a notable gap exists in the availability of comprehensive datasets encompassing Xenopus’ late developmental stages.Findings In the present study, we harnessed micro-computed tomography (micro-CT), a non-invasive 3D imaging technique utilizing X-rays to examine structures at a micrometer scale, to investigate the developmental dynamics and morphological changes of this crucial vertebrate model. Our approach involved generating high-resolution images and computed 3D models of developing Xenopus specimens, spanning from premetamorphosis tadpoles to fully mature adult frogs. This extensive dataset enhances our understanding of vertebrate development and is adaptable for various analyses. For instance, we conducted a thorough examination, analyzing body size, shape, and morphological features, with a specific emphasis on skeletogenesis, teeth, and organs like the brain at different stages. Our analysis yielded valuable insights into the morphological changes and structure dynamics in 3D space during Xenopus’ development, some of which were not previously documented in such meticulous detail. This implies that our datasets effectively capture and thoroughly examine Xenopus specimens. Thus, these datasets hold the solid potential for additional morphological and morphometric analyses, including individual segmentation of both hard and soft tissue elements within Xenopus.Conclusions Our repository of micro-CT scans represents a significant resource that can enhance our understanding of Xenopus’ development and the associated morphological changes. The widespread utility of this amphibian species, coupled with the exceptional quality of our scans, which encompass a comprehensive series of developmental stages, opens up extensive opportunities for their broader research application. Moreover, these scans have the potential for use in virtual reality, 3D printing, and educational contexts, further expanding their value and impact.

      This work has been peer reviewed in GigaScience (see paper), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer name: Virgilio Gail Ponferrada (R1)

      Thanks to the authors for accommodating the reviewers' suggestions. The manuscript continues to be well constructed and easy to read. I appreciate the addition of micro-CT analysis of Xenopus gut development and the inclusion of scans of additional samples for statistical analysis bolstering their findings. Should the manuscript be accepted for publication, perhaps the authors will contact Xenbase (www.xenbase.org), the Xenopus research database, as an additional means of featuring their micro-CT datasets. I suggest this manuscript be accepted for publication.

    2. Background Xenopus laevis, the African clawed frog, is a versatile vertebrate model organism employed across various biological disciplines, prominently in developmental biology to elucidate the intricate processes underpinning body plan reorganization during metamorphosis. Despite its widespread utility, a notable gap exists in the availability of comprehensive datasets encompassing Xenopus’ late developmental stages.Findings In the present study, we harnessed micro-computed tomography (micro-CT), a non-invasive 3D imaging technique utilizing X-rays to examine structures at a micrometer scale, to investigate the developmental dynamics and morphological changes of this crucial vertebrate model. Our approach involved generating high-resolution images and computed 3D models of developing Xenopus specimens, spanning from premetamorphosis tadpoles to fully mature adult frogs. This extensive dataset enhances our understanding of vertebrate development and is adaptable for various analyses. For instance, we conducted a thorough examination, analyzing body size, shape, and morphological features, with a specific emphasis on skeletogenesis, teeth, and organs like the brain at different stages. Our analysis yielded valuable insights into the morphological changes and structure dynamics in 3D space during Xenopus’ development, some of which were not previously documented in such meticulous detail. This implies that our datasets effectively capture and thoroughly examine Xenopus specimens. Thus, these datasets hold the solid potential for additional morphological and morphometric analyses, including individual segmentation of both hard and soft tissue elements within Xenopus.Conclusions Our repository of micro-CT scans represents a significant resource that can enhance our understanding of Xenopus’ development and the associated morphological changes. The widespread utility of this amphibian species, coupled with the exceptional quality of our scans, which encompass a comprehensive series of developmental stages, opens up extensive opportunities for their broader research application. Moreover, these scans have the potential for use in virtual reality, 3D printing, and educational contexts, further expanding their value and impact.

      This work has been peer reviewed in GigaScience (see paper), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer name: John Wallingford (Original submission)

      Laznovsky et al. present a nice compendium of micro-CT-based digital volumes of several stages of Xenopus development. Given the prominence of this important model animal in studies of developmental biology and physiology, this dataset is quite useful and will be of interest to the community. That said, the study has some key limitations that will limit its utility for the research community, though these do not reduce the dataset's impact in the education and popular science realms, which is also a stated goal for the paper. Overall, I recommend publication after an effort has been made to address the following concerns.

      1. The atlas adequately samples developmental stages from late tadpole through metamorphosis. However, as far as I can tell only a single sample has been imaged at each stage. Thus, the quantifications of inter-stage differences shown here (Fig. 2, 4, 5) are at best very rough estimates and also provide no information about intra-stage variability in these metrics. This is not a fatal weakness, but it is an important caveat that I believe should be very explicitly stated in the text and in the figure legend of relevant figures.

      2. I am very disappointed that the rich history of microCT on Xenopus seems to have been entirely ignored by these authors. MicroCT has already been used to describe the skull, the brain, liver, blood vessels, etc. during Xenopus development. (Just a few papers the authors should read are: Slater et al., PLoS One 2009; Senevirathnea et al., PNAS, 2019; Ishii et al., Dev. Growth, Diff. 2023; Zhu et al., Front. Zool 2020.) It has also been used for comparative studies of other frogs (Kondo et al., Dev. Growth, Diff. 2022; Kraus, Anat. Rec. 2021; Jandausch et al., Zool. Anz. 2022; Paluh, et al., Evolution 2021, Paluh et al., eLife 2021). None of these -or the many other relevant papers- are discussed or cited here. The research community would be much better served if authors make a serious effort to integrate their methods and their results into this existing literature.

      3. An opportunity may have been missed here to provide some truly new biological insights: The gut remodels substantially during metamorphosis, but to my knowledge that has NOT be previously examined by microCT. It may not work, as the gut may simply be too soft to visualize, but then again, it may be worth trying.

    3. Background Xenopus laevis, the African clawed frog, is a versatile vertebrate model organism employed across various biological disciplines, prominently in developmental biology to elucidate the intricate processes underpinning body plan reorganization during metamorphosis. Despite its widespread utility, a notable gap exists in the availability of comprehensive datasets encompassing Xenopus’ late developmental stages.Findings In the present study, we harnessed micro-computed tomography (micro-CT), a non-invasive 3D imaging technique utilizing X-rays to examine structures at a micrometer scale, to investigate the developmental dynamics and morphological changes of this crucial vertebrate model. Our approach involved generating high-resolution images and computed 3D models of developing Xenopus specimens, spanning from premetamorphosis tadpoles to fully mature adult frogs. This extensive dataset enhances our understanding of vertebrate development and is adaptable for various analyses. For instance, we conducted a thorough examination, analyzing body size, shape, and morphological features, with a specific emphasis on skeletogenesis, teeth, and organs like the brain at different stages. Our analysis yielded valuable insights into the morphological changes and structure dynamics in 3D space during Xenopus’ development, some of which were not previously documented in such meticulous detail. This implies that our datasets effectively capture and thoroughly examine Xenopus specimens. Thus, these datasets hold the solid potential for additional morphological and morphometric analyses, including individual segmentation of both hard and soft tissue elements within Xenopus.Conclusions Our repository of micro-CT scans represents a significant resource that can enhance our understanding of Xenopus’ development and the associated morphological changes. The widespread utility of this amphibian species, coupled with the exceptional quality of our scans, which encompass a comprehensive series of developmental stages, opens up extensive opportunities for their broader research application. Moreover, these scans have the potential for use in virtual reality, 3D printing, and educational contexts, further expanding their value and impact.

      This work has been peer reviewed in GigaScience (see paper), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer name: Virgilio Gail Ponferrada (Original submission)

      The manuscript is well written and easy to understand. It will be a good contribution to the Xenopus research community as well as a useful reference for the field of developmental and amphibian biology.

      I suggest the following revisions: - For the graphical abstract try alternating NF stage numbers above and below samples for a cleaner look, adult male and adult female can both remain at the top. - Appreciate the rationale for providing the microCT analysis presented in this manuscript and choices of late stage tadpoles, pre- and prometamorphosis, through metamporphosis to the adult male and female frog. - For the head development section authors can make reference to the Xenhead drawings, Zahn et al. Development 2017. - Head Development section paragraph 4, change word from "gender" to "sex." - Supplementary Table 3. Change "gender-related" to "sex-related." - Micro-CT Data Analysis of Long Bone Growth Dynamics section paragraph 1 change "in terms of gender" to "in terms of sex." - Figure 4 panels A and B don't reflect the observation that adult females are enlarged males. While the authors state that the view of the male and female skeletons are maximized and not proportional as stated in the caption, suggest that scale bars be employed and the images adjusted to show the size relationship difference between the sexes as in Figure 1. On first glance and perhaps to those not as familiar with the difference in sex size in Xenopus that in this particular example of the adult male image being more spread out compared to the image of the female, it feels misleading. - Ossification Analysis section paragraph 2 change "frog's gender" to "frog's sex." - Figure 5 panel A, the label is overlapping "NF 59." For panels B and B' scale bars on these panels would help the reader understand the proportions. Yes, there is the 3mm scale bar from panel A and as stated in the caption, but including them in the B panels could help even if panel B had a scale bar labeled at 0.25 mm and panel B' was 3 mm. - Segmentation of Selected Internal Soft Organ section, perhaps more commentary on the ability to observe the development of the segmentation of the brain regions: cbh: cerebral hemispheres; cbl: cerebellum; dch: diencephalon; mob: medulla oblongata; opl: optic lobes; sp: spinal cord while clearly shown in Figure 6, some accompanying description in the text would help readers in general or give the implication that microCT analysis of mutant or diseased frogs could help identify physical characteristics of frogs with developmental or neurological disorders. This would help transition from the analysis of a specific organ to the next section Further Biological Potential of Xenopus's Data. - These analyses, while thorough accompanied by novel visuals, require statistical implementation of multiple tadpoles and frogs per NF stage to account for variation in samples and to bolster the claims stated in skull thickness, the head mass and eye distance changes, increased length of the long bones during maturation, and femural ossification cartilage to bone ratios. This may constitute a suggested major revision to perform these analyses.

    4. Background Xenopus laevis, the African clawed frog, is a versatile vertebrate model organism employed across various biological disciplines, prominently in developmental biology to elucidate the intricate processes underpinning body plan reorganization during metamorphosis. Despite its widespread utility, a notable gap exists in the availability of comprehensive datasets encompassing Xenopus’ late developmental stages.Findings In the present study, we harnessed micro-computed tomography (micro-CT), a non-invasive 3D imaging technique utilizing X-rays to examine structures at a micrometer scale, to investigate the developmental dynamics and morphological changes of this crucial vertebrate model. Our approach involved generating high-resolution images and computed 3D models of developing Xenopus specimens, spanning from premetamorphosis tadpoles to fully mature adult frogs. This extensive dataset enhances our understanding of vertebrate development and is adaptable for various analyses. For instance, we conducted a thorough examination, analyzing body size, shape, and morphological features, with a specific emphasis on skeletogenesis, teeth, and organs like the brain at different stages. Our analysis yielded valuable insights into the morphological changes and structure dynamics in 3D space during Xenopus’ development, some of which were not previously documented in such meticulous detail. This implies that our datasets effectively capture and thoroughly examine Xenopus specimens. Thus, these datasets hold the solid potential for additional morphological and morphometric analyses, including individual segmentation of both hard and soft tissue elements within Xenopus.Conclusions Our repository of micro-CT scans represents a significant resource that can enhance our understanding of Xenopus’ development and the associated morphological changes. The widespread utility of this amphibian species, coupled with the exceptional quality of our scans, which encompass a comprehensive series of developmental stages, opens up extensive opportunities for their broader research application. Moreover, these scans have the potential for use in virtual reality, 3D printing, and educational contexts, further expanding their value and impact.

      This work has been peer reviewed in GigaScience (see paper), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer name: Brian Metscher (Original submission)

      The authors present a set of 3D images of selected developmental stages of the widely-used laboratory model Xenopus laevis along with some examples of how the data might be used in developmental analyses. The dataset covers stages from mid-larva through metamorphosis to adult, which should provide a starting point for various studies of morphological development. Some studies will undoubtedly require other stages or more detailed images, but the presented data were collected with straightforward methods that will allow compatibility with future work.

      The data appear to be sound in the collection and curation. Data availability is made clear in the article, and the complete set will be publicly available in standard formats on the Zenodo repository. This should ensure full accessibility to anyone interested. The article is well-organized and clearly written.

      A few points about the methods could be clarified: Was only one specimen per stage scanned? Specimens were dehydrated through an ethanol series and then stained with free iodine in 90% methanol, and then rehydrated back through ethanol. Why was methanol used for the staining and not dehydration? It seems odd to switch alcohols back and forth without intermediate steps. This could have some effect on tissue shrinkage. It should be indicated that the X-ray source target is tungsten (even though it is unlikely to be anything else in this machine). The "real images" (p. 7) in Suppl. Fig. 1 should simply be called photographs - microCT images are real too. For the measurements of bone mass, is the cartilage itself actually visible in the microCT images? p. 13: "The dataset's diverse species representation…" What does this mean? It is only one species. The limitations on the image data are not discussed. All images have limits to their useful resolution and contrast among components; this is not a weakness, just a reality of imaging. The different reconstructed voxel sizes for different size specimens are mentioned, but it might be helpful to indicate the voxel sizes in Figure 1 as well as in the relevant table. And if the middle column of Figure 1 could be published with full resolution of the snapshots it would help show the actual quality of the images.

    1. Background Over the past few years, the rise of omics technologies has offered an exceptional chance to gain a deeper insight into the structural and functional characteristics of microbial communities. As a result, there is a growing demand for user friendly, reproducible, and versatile bioinformatic tools that can effectively harness multi-omics data to offer a holistic understanding of microbiomes. Previously, we introduced gNOMO, a bioinformatic pipeline specifically tailored to analyze microbiome multi-omics data in an integrative manner. In response to the evolving demands within the microbiome field and the growing necessity for integrated multi-omics data analysis, we have implemented substantial enhancements to the gNOMO pipeline.Results Here, we present gNOMO2, a comprehensive and modular pipeline that can seamlessly manage various omics combinations, ranging from two to four distinct omics data types including 16S rRNA gene amplicon sequencing, metagenomics, metatranscriptomics, and metaproteomics. Furthermore, gNOMO2 features a specialized module for processing 16S rRNA gene amplicon sequencing data to create a protein database suitable for metaproteomics investigations. Moreover, it incorporates new differential abundance, integration and visualization approaches, all aimed at providing a more comprehensive toolkit and insightful analysis of microbiomes. The functionality of these new features is showcased through the use of four microbiome multi-omics datasets encompassing various ecosystems and omics combinations. gNOMO2 not only replicated most of the primary findings from these studies but also offered further valuable perspectives.Conclusions gNOMO2 enables the thorough integration of taxonomic and functional analyses in microbiome multi-omics data, opening up avenues for novel insights in the field of both host associated and free-living microbiome research. gNOMO2 is available freely at https://github.com/muzafferarikan/gNOMO2.

      This work has been peer reviewed in GigaScience (see paper), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer name: Yuan Jiang (R1)

      The authors have fully addressed my comments.

    2. Background Over the past few years, the rise of omics technologies has offered an exceptional chance to gain a deeper insight into the structural and functional characteristics of microbial communities. As a result, there is a growing demand for user friendly, reproducible, and versatile bioinformatic tools that can effectively harness multi-omics data to offer a holistic understanding of microbiomes. Previously, we introduced gNOMO, a bioinformatic pipeline specifically tailored to analyze microbiome multi-omics data in an integrative manner. In response to the evolving demands within the microbiome field and the growing necessity for integrated multi-omics data analysis, we have implemented substantial enhancements to the gNOMO pipeline.Results Here, we present gNOMO2, a comprehensive and modular pipeline that can seamlessly manage various omics combinations, ranging from two to four distinct omics data types including 16S rRNA gene amplicon sequencing, metagenomics, metatranscriptomics, and metaproteomics. Furthermore, gNOMO2 features a specialized module for processing 16S rRNA gene amplicon sequencing data to create a protein database suitable for metaproteomics investigations. Moreover, it incorporates new differential abundance, integration and visualization approaches, all aimed at providing a more comprehensive toolkit and insightful analysis of microbiomes. The functionality of these new features is showcased through the use of four microbiome multi-omics datasets encompassing various ecosystems and omics combinations. gNOMO2 not only replicated most of the primary findings from these studies but also offered further valuable perspectives.Conclusions gNOMO2 enables the thorough integration of taxonomic and functional analyses in microbiome multi-omics data, opening up avenues for novel insights in the field of both host associated and free-living microbiome research. gNOMO2 is available freely at https://github.com/muzafferarikan/gNOMO2.

      This work has been peer reviewed in GigaScience (see paper), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer name: Yuan Jiang (original submission)

      Referee Report for "gNOMO2: a comprehensive and modular pipeline for integrated multi-omics analyses of microbiomes"

      This paper introduced gNOMO2, a new version of gNOMO, which is a bioinformatic pipeline for multiomic management and analysis of microbiomes. The authors claimed that gNOMO2 incorporates new differential abundance, integration, and visualization tools compared to gNOMO. However, these new features as well as the distinction between gNOMO2 and gNOMO has not been clearly presented in the paper. In addition, the Methods section is written as a pipeline of bioinformatic tools and it is not clear what these tools are used for unless one is familiar with all the bioinformatic tools.

      My major comments are as follows:

      1. Given the existing work on gNOMO, it is critical for the authors to distinguish gNOMO2 from gNOMO to show its novelty. In the Methods section, the authors presented the six modules of gNOMO2. Are these all new from gNOMO, or does gNOMO included some of these functions? A clearer presentation of gNOMO2 versus gNOMO is needed.
      2. The authors did not present the methods in each module very well. For example, the authors wrote in Module 2 that "MaAsLin2 [31] is employed to determine differentially abundant taxa based on both AS and MP data. Furthermore, a joint visualization of MP and AS results is performed using the combi R package [32]. The final outputs include AS and MP based abundance tables, results from differential abundance analysis, and joint visualization analysis results." Without reading the references 31 and 32, it is very hard to understand what this module is really doing.
      3. The authors used the term "integrated multi-omics analysis" in all six modules of gNOMO2. It is not clear how this terms really means. It reads like that it is not really integrated analysis, instead, it is more like a module that can handle different types of data separately, such as differential abundance analysis for each type. What other integration has been used except joint visualization? What new integration tools have been incorporated in gNOMO2?
      4. In the differential abundance analysis, does the pipeline consider the features of microbiome data, such as their count, sparsity, and compositional features? Can the modules incorporate covariates in their differential abundance analysis? It is quite useful to have covariates adjusted in a differential abundance analysis?
      5. In the Analyses section, the authors applied gNOMO2 to re-analyze samples from previously published studies. They found some discrepancy between their results and the ones in the literature. Although some discrepancy is normal, the authors need to explain better what causes the discrepancy and whether it could yield different biological conclusions.
    3. Background Over the past few years, the rise of omics technologies has offered an exceptional chance to gain a deeper insight into the structural and functional characteristics of microbial communities. As a result, there is a growing demand for user friendly, reproducible, and versatile bioinformatic tools that can effectively harness multi-omics data to offer a holistic understanding of microbiomes. Previously, we introduced gNOMO, a bioinformatic pipeline specifically tailored to analyze microbiome multi-omics data in an integrative manner. In response to the evolving demands within the microbiome field and the growing necessity for integrated multi-omics data analysis, we have implemented substantial enhancements to the gNOMO pipeline.Results Here, we present gNOMO2, a comprehensive and modular pipeline that can seamlessly manage various omics combinations, ranging from two to four distinct omics data types including 16S rRNA gene amplicon sequencing, metagenomics, metatranscriptomics, and metaproteomics. Furthermore, gNOMO2 features a specialized module for processing 16S rRNA gene amplicon sequencing data to create a protein database suitable for metaproteomics investigations. Moreover, it incorporates new differential abundance, integration and visualization approaches, all aimed at providing a more comprehensive toolkit and insightful analysis of microbiomes. The functionality of these new features is showcased through the use of four microbiome multi-omics datasets encompassing various ecosystems and omics combinations. gNOMO2 not only replicated most of the primary findings from these studies but also offered further valuable perspectives.Conclusions gNOMO2 enables the thorough integration of taxonomic and functional analyses in microbiome multi-omics data, opening up avenues for novel insights in the field of both host associated and free-living microbiome research. gNOMO2 is available freely at https://github.com/muzafferarikan/gNOMO2.

      This work has been peer reviewed in GigaScience (see paper), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer name: Alexander Bartholomaus (original submission)

      Summary: "gNOMO2: a comprehensive and modular pipeline for integrated multi-omics analyses of microbiomes" by Arıkan and Muth presents a multi-omics tools for analysis of prokaryotes. It is an evolution of the first version and offers various separate modules, taking different type of input data. They present different example analysis based on already published data and reproduced the results. The manuscript is very well written (I could not detect a single typo) and it was fun to read! Well done! I have only very few comments and suggestions, see below. However, I had a problem executing the code.

      Key questions to answer: 1) Are the methods appropriate to the aims of the study, are they well described, and are necessary controls included? Yes 2) Are the conclusions adequately supported by the data shown? Yes 3) Please indicate the quality of language in the manuscript. Does it require a heavy editing for language and clarity? Very well written! 4) Are you able to assess all statistics in the manuscript, including the appropriateness of statistical tests used? No direct statistics given in the manuscript. Maybe the authors could include some example output as .zip file for interested potential users.

      Detailed comments to the manuscript: Line 168: What does "cleaned and redundancies are removed" mean? Are only identical genomes removed? Or are genome part that are identical (I guess this barely exists, except for conserved gene parts as 16S, or similar) removed? Or are only redundant genes removed? How is redundancy defined, 99% identical stretch? Line 399-405: When looking at figure 5A I am wondering how Fluviicoccus and Methanosarcina in the MP faction appear relatively abundant in some samples. Where they de novo assembled in the MG or MT modules? General comment figures: I know that it is a hack to deal with automatic figure generation and especially the axis labels (as names have very different length). However, I think some figures might be hardly visable in the printed version, especially axes label for panel B are very small. Maybe you can put the critical figures separately in the supplement, e.g. each B panel a one page.

      Suggestions: As suggest above, maybe the authors could include some example output (a simple example) as .zip file for interested potential users. This would give an idea of how the output looks like and what to expect besides the plots. But differential abundance tables might be more important than the plots, as the user would generate their own plot for later publications.

      Github and software: I also tested the software and followed the instructions in the Github. I successfully executed the "Requirements" and "Config" steps (including create of metadata file and copying of amplicon reads) and tried to execute Modul1.

      However, the following error occurred (using up-to-date conda and snakemake on Ubuntu linux): (snakemake) abartho@gmbs17:~/review_papers/GigaScience/gNOMO2$ snakemake -v 6.15.5 (snakemake) abartho@gmbs17:~/review_papers/GigaScience/gNOMO2$ snakemake -s workflow/Snakefile --cores 20 SyntaxError in line 9 of /home/abartho/miniconda3/envs/snakemake/lib/python3.6/sitepackages/smart_open/s3.py: future feature annotations is not defined (s3.py, line 9) File "/home/abartho/miniconda3/envs/snakemake/lib/python3.6/sitepackages/smart_open/init.py", line 34, in <module> File "/home/abartho/miniconda3/envs/snakemake/lib/python3.6/sitepackages/smart_open/smart_open_lib.py", line 35, in <module> File "/home/abartho/miniconda3/envs/snakemake/lib/python3.6/sitepackages/smart_open/doctools.py", line 21, in <module> File "/home/abartho/miniconda3/envs/snakemake/lib/python3.6/sitepackages/smart_open/transport.py", line 104, in <module> File "/home/abartho/miniconda3/envs/snakemake/lib/python3.6/sitepackages/smart_open/transport.py", line 49, in register_transport File "/home/abartho/miniconda3/envs/snakemake/lib/python3.6/importlib/init.py", line 126, in import_module In addition to solving the problem, an example metadata file and some explanation about the output (which I did not see yet) would be good for less experienced users.

    1. Background As biological data increases, we need additional infrastructure to share it and promote interoperability. While major effort has been put into sharing data, relatively less emphasis is placed on sharing metadata. Yet, sharing metadata is also important, and in some ways has a wider scope than sharing data itself.Results Here, we present PEPhub, an approach to improve sharing and interoperability of biological metadata. PEPhub provides an API, natural language search, and user-friendly web-based sharing and editing of sample metadata tables. We used PEPhub to process more than 100,000 published biological research projects and index them with fast semantic natural language search. PEPhub thus provides a fast and user-friendly way to finding existing biological research data, or to share new data.

      This work has been peer reviewed in GigaScience (see paper), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer name: Weiwen Wang (R1)

      The author has addressed most of my concerns, although some issues remain unresolved due to hardware and technical limitations.

    2. Background As biological data increases, we need additional infrastructure to share it and promote interoperability. While major effort has been put into sharing data, relatively less emphasis is placed on sharing metadata. Yet, sharing metadata is also important, and in some ways has a wider scope than sharing data itself.Results Here, we present PEPhub, an approach to improve sharing and interoperability of biological metadata. PEPhub provides an API, natural language search, and user-friendly web-based sharing and editing of sample metadata tables. We used PEPhub to process more than 100,000 published biological research projects and index them with fast semantic natural language search. PEPhub thus provides a fast and user-friendly way to finding existing biological research data, or to share new data.

      This work has been peer reviewed in GigaScience (see paper), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer name: Weiwen Wang (original submission)

      This manuscript by LeRoy et al. introduces PEPhub, a database aimed at enhancing the sharing and interoperability of biological metadata using the PEP framework. One of the key highlights of this manuscript is the visualization of the PEP framework, which improves the adoption of the PEP framework, facilitating the reuse of metadata. Additionally, PEPhub integrates data from GEO, making it convenient for users to access and utilize. Furthermore, PEPhub offers metadata validation, allowing users to quickly compare their PEP with other PEPhub schemas. Another notable feature is the natural language search, which further enhances the user experience. Overall, PEPhub provides a comprehensive solution that promotes efficient metadata sharing, while leveraging the impact of the PEP framework in organizing large-scale biological research projects.While this manuscript was interesting to read, I have several concerns regarding its "semantic" search system and the interaction of PEPHub.1.

      The authors mentioned their use of a tool called "pepembed" to embed PEP descriptions into vectors. However, I was unable to locate the tool on GitHub, and there is limited information in the Method section regarding this. Could the authors provide additional details regarding the process of embedding vectors?2. The authors implemented semantic search as an advantage of PEPhub. However, they did not evaluate the effectiveness of their natural language search engine, such as assessing accuracy, recall rate, or F1 score. It would be beneficial for the authors to perform an evaluation of their natural language search engine and provide metrics to demonstrate its performance. This would enhance the credibility and reliability of their claims regarding the advantages of natural language search in PEPhub.3. It would be more beneficial to include the metadata in the search system rather than solely relying on the project description. For instance, when I searched for SRX17165287 (https://pephub.databio.org/geo/gse211736?tag=default), no results were returned.4. When creating a new PEP, it appears that I can submit two samples with identical values. According to the PEP framework guidelines, it is mentioned that "Typically, samples should have unique values in the sample table index column". Therefore, the authors should enhance their metadata validation system to enforce this uniqueness constraint. Additionally, if I enter two identical values in the sample field and then attempt to add a SUBSAMPLE, an error occurs. However, when I modify one of the samples, I am able to save it successfully.5. The error messages should provide more specific guidance. Currently, when attempting to save metadata with an incorrect format, all error messages are displayed as: "Unknown error occurred: Unknown".6.

      PEPhub should consider providing user guidelines or examples on how to fill in subsample metadata and any relevant rules associated with it.7. In the Validation module, what are the rules for validation? Does it only check for the required column names in the schema, or does it also validate the content of the metadata, such as whether the metadata is in the correct format (e.g., int or string)? Additionally, it would be beneficial to provide an option to download the relevant schema and clearly specify the required column names in the schema. This would enable users to better organize their PEP to comply with the schema format and ensure that their metadata is accurately validated.8. This version of PEPHub primarily focuses on metadata. Have the authors considered any plans to expand this database to include data/pipeline management within the PEP framework? It would be valuable for the authors to discuss their future plans for PEPHub in this manuscript.Some minor concerns:1. When searching for content within a specific namespace, it would be beneficial for the pagination bar at the bottom of the webpage to display the number of pages. Now there are only Previous/Next buttons.2. As a web service, it is better to show the supporting browsers, such as Google Chrome (version xxx and above), Firefox (version xxx and above). I failed to open PEPHub website using an old version of Chrome.

    3. Background As biological data increases, we need additional infrastructure to share it and promote interoperability. While major effort has been put into sharing data, relatively less emphasis is placed on sharing metadata. Yet, sharing metadata is also important, and in some ways has a wider scope than sharing data itself.Results Here, we present PEPhub, an approach to improve sharing and interoperability of biological metadata. PEPhub provides an API, natural language search, and user-friendly web-based sharing and editing of sample metadata tables. We used PEPhub to process more than 100,000 published biological research projects and index them with fast semantic natural language search. PEPhub thus provides a fast and user-friendly way to finding existing biological research data, or to share new data.

      This work has been peer reviewed in GigaScience (see paper), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer name: Jeremy Leipzig (original submission)

      Metadata describes the who, what, where, when, and why of an experiment. Sample metadata is arguably the most important of these, but not the only type. LeRoy et al describes a user-centric sample metadata management system with extensibility, support for multiple interface modalities, and fuzzy semantic search.This system and portal, PEPHub, bridges the gaps between LIMS, which are tightly bound to the wet lab, metadata fetchers like GEOfetch (from the same group) or pysradb, and public portals like MetaSRA and the others listed in . Then and both of which don't allow you to roll your own portal internally, and whose search criteria are not fuzzy or semantic.People have been storing metadata in bespoke databases for decades, but not in an interoperable mature fashion. The PepHUB portal builds on some existing Pep standards by the same group, introducing a restful API and GUI.I find this paper a novel and compelling submission but would like the following minor revisions:1. Typically in SRA a sample refers to a dna sample drawn from a tissue sample (ie BioSample) and then runs describe sequencing attempts on those dna samples, and files are produced from each of the runs. It is unclear to me how someone working in an internal lab using PEPHub would know how to extract the file locations of sequence files associated with a sample if these are many-to-one. In the GEO example provided I can click on the SRX link to see the runs and files but how would this work for an internally generated entry? I need the authors to explain this either as a response or in the text.2. I think the paper has to briefly describe how the authors envision how PEPhub should interface with or replaces a LIMS for labs that are producing their own data and describe how it can help accelerate the SRA submission process for these data generating labs.3. Change "Bernasconi2021" to META-BASE in the text4. Some of the search confidence measures show an absurd level of significant digits (e.g.56.99999999999999% Please round that as it's only used for sorting.

    1. Cohort studies increasingly collect biosamples for molecular profiling and are observing molecular heterogeneity. High throughput RNA sequencing is providing large datasets capable of reflecting disease mechanisms. Clustering approaches have produced a number of tools to help dissect complex heterogeneous datasets, however, selecting the appropriate method and parameters to perform exploratory clustering analysis of transcriptomic data requires deep understanding of machine learning and extensive computational experimentation. Tools that assist with such decisions without prior field knowledge are nonexistent. To address this we have developed Omada, a suite of tools aiming to automate these processes and make robust unsupervised clustering of transcriptomic data more accessible through automated machine learning based functions. The efficiency of each tool was tested with five datasets characterised by different expression signal strengths to capture a wide spectrum of RNA expression datasets. Our toolkit’s decisions reflected the real number of stable partitions in datasets where the subgroups are discernible. Within datasets with less clear biological distinctions, our tools either formed stable subgroups with different expression profiles and robust clinical associations or revealed signs of problematic data such as biased measurements.

      This work has been peer reviewed in GigaScience (see paper), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer name: Casey S. Greene (R2)

      The authors describe Omada, which is software to cluster transcriptomic data using multiple methods. The approach selects a heuristically best method from among those tested. The manuscript does describe a software package and there is evidence that the implementation works as described. The manuscript structure was substantially easier for me to follow with the revisions. The manuscript does not have evidence that the method outperforms other potential approaches in this space. It is not clear to me if this is or is not an important consideration for this journal. The form requires that I select from among the options offered. Given that this requires editorial assessment, I have marked "Minor Revision" but I do not feel a minor revision is necessary if, with the present content of the paper, the editor feels it is appropriate. If a revision is deemed necessary, I expect it would need to be a major one.

    2. Cohort studies increasingly collect biosamples for molecular profiling and are observing molecular heterogeneity. High throughput RNA sequencing is providing large datasets capable of reflecting disease mechanisms. Clustering approaches have produced a number of tools to help dissect complex heterogeneous datasets, however, selecting the appropriate method and parameters to perform exploratory clustering analysis of transcriptomic data requires deep understanding of machine learning and extensive computational experimentation. Tools that assist with such decisions without prior field knowledge are nonexistent. To address this we have developed Omada, a suite of tools aiming to automate these processes and make robust unsupervised clustering of transcriptomic data more accessible through automated machine learning based functions. The efficiency of each tool was tested with five datasets characterised by different expression signal strengths to capture a wide spectrum of RNA expression datasets. Our toolkit’s decisions reflected the real number of stable partitions in datasets where the subgroups are discernible. Within datasets with less clear biological distinctions, our tools either formed stable subgroups with different expression profiles and robust clinical associations or revealed signs of problematic data such as biased measurements.

      This work has been peer reviewed in GigaScience (see paper), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer name: Casey S. Greene (R1)

      The authors have revised their manuscript. They added benchmarking for the method, which is important. The following overall comments still apply - there is not substantial evidence provided for the selections made:

      "I found the manuscript difficult to read. It reads somewhat like a how-to guide and somewhat like a software package. I recommend approaching this as a software package, which would require adding evidence to support the choices made. Describe the purpose for the package, evidence for the choices made, benchmarking (compute and performance), describe application to one or more case studies, and discuss how the work fits into the context.

      The evaluation includes two simulation studies and then application to a few real datasets; however, for all real datasets the problem is either very easy or the answer is unknown. The largest challenges I have with the manuscript are the large number of arbitrarily selected parameters the limited evidence available to support those as strong choices.

      Conceptually, an alternative strategy is to consider real clusters to be those that are robust over many clustering methods. In this case, the best clusters are those that are maximally detectable with a single method. While there exists software for the former strategy, this package implements the latter strategy. It is not intuitively clear to me that this framework is superior to the other for biological discovery. It seems like general clusters (i.e., those that persist across multiple parameterizations) may be the most fruitful to pursue. It would be helpful to provide evidence that the selected strategy has superior utility in at least some settings and a description of how those settings might be identified." It is possible this is not necessary, but I simply note it as I continue to have these challenges with the revised manuscript.

    3. Cohort studies increasingly collect biosamples for molecular profiling and are observing molecular heterogeneity. High throughput RNA sequencing is providing large datasets capable of reflecting disease mechanisms. Clustering approaches have produced a number of tools to help dissect complex heterogeneous datasets, however, selecting the appropriate method and parameters to perform exploratory clustering analysis of transcriptomic data requires deep understanding of machine learning and extensive computational experimentation. Tools that assist with such decisions without prior field knowledge are nonexistent. To address this we have developed Omada, a suite of tools aiming to automate these processes and make robust unsupervised clustering of transcriptomic data more accessible through automated machine learning based functions. The efficiency of each tool was tested with five datasets characterised by different expression signal strengths to capture a wide spectrum of RNA expression datasets. Our toolkit’s decisions reflected the real number of stable partitions in datasets where the subgroups are discernible. Within datasets with less clear biological distinctions, our tools either formed stable subgroups with different expression profiles and robust clinical associations or revealed signs of problematic data such as biased measurements.

      This work has been peer reviewed in GigaScience (see paper), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer name: Pierre Cauchy (R1)

      Kariotis et al. have efficiently addressed most reviewer comments. Omada, the tool presented there will be of interest to the oncology and bioinformatics communities.

    4. Cohort studies increasingly collect biosamples for molecular profiling and are observing molecular heterogeneity. High throughput RNA sequencing is providing large datasets capable of reflecting disease mechanisms. Clustering approaches have produced a number of tools to help dissect complex heterogeneous datasets, however, selecting the appropriate method and parameters to perform exploratory clustering analysis of transcriptomic data requires deep understanding of machine learning and extensive computational experimentation. Tools that assist with such decisions without prior field knowledge are nonexistent. To address this we have developed Omada, a suite of tools aiming to automate these processes and make robust unsupervised clustering of transcriptomic data more accessible through automated machine learning based functions. The efficiency of each tool was tested with five datasets characterised by different expression signal strengths to capture a wide spectrum of RNA expression datasets. Our toolkit’s decisions reflected the real number of stable partitions in datasets where the subgroups are discernible. Within datasets with less clear biological distinctions, our tools either formed stable subgroups with different expression profiles and robust clinical associations or revealed signs of problematic data such as biased measurements.

      This work has been peer reviewed in GigaScience (see paper), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer name: Casey S. Greene (original submission)

      The authors describe a system for clustering gene expression data. The manuscript describes clustering workflows (data cleaning, assessing data structure, etc).

      I found the manuscript difficult to read. It reads somewhat like a how-to guide and somewhat like a software package. I recommend approaching this as a software package, which would require adding evidence to support the choices made. Describe the purpose for the package, evidence for the choices made, benchmarking (compute and performance), describe application to one or more case studies, and discuss how the work fits into the context.

      The evaluation includes two simulation studies and then application to a few real datasets; however, for all real datasets the problem is either very easy or the answer is unknown. The largest challenges I have with the manuscript are the large number of arbitrarily selected parameters the limited evidence available to support those as strong choices. Conceptually, an alternative strategy is to consider real clusters to be those that are robust over many clustering methods. In this case, the best clusters are those that are maximally detectable with a single method. While there exists software for the former strategy, this package implements the latter strategy. It is not intuitively clear to me that this framework is superior to the other for biological discovery. It seems like general clusters (i.e., those that persist across multiple parameterizations) may be the most fruitful to pursue. It would be helpful to provide evidence that the selected strategy has superior utility in at least some settings and a description of how those settings might be identified. I examined the vignette, and I found that it provided a set of examples. I can imagine that running this on larger datasets would be highly time-consuming. It would be helpful to add benchmarking or an estimate of compute time. Given that this seems feasible to parallelize, it might make sense to provide a mechanism for parallelization.

      I examined the software briefly. There are some comments. Dead code exists in some files. There is at least one typo in a filename (gene_singatures.R). Some of the choices that seemed arbitrary appear to be written into the software (e.g., get_top30percent_coefficients.R).

    5. Cohort studies increasingly collect biosamples for molecular profiling and are observing molecular heterogeneity. High throughput RNA sequencing is providing large datasets capable of reflecting disease mechanisms. Clustering approaches have produced a number of tools to help dissect complex heterogeneous datasets, however, selecting the appropriate method and parameters to perform exploratory clustering analysis of transcriptomic data requires deep understanding of machine learning and extensive computational experimentation. Tools that assist with such decisions without prior field knowledge are nonexistent. To address this we have developed Omada, a suite of tools aiming to automate these processes and make robust unsupervised clustering of transcriptomic data more accessible through automated machine learning based functions. The efficiency of each tool was tested with five datasets characterised by different expression signal strengths to capture a wide spectrum of RNA expression datasets. Our toolkit’s decisions reflected the real number of stable partitions in datasets where the subgroups are discernible. Within datasets with less clear biological distinctions, our tools either formed stable subgroups with different expression profiles and robust clinical associations or revealed signs of problematic data such as biased measurements.

      This work has been peer reviewed in GigaScience (see paper), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer name: **Pierre Cauchy **

      Kariotis et al present Omada, a tool dedicated to automated partitioning of large-scale, cohort-based RNA-Sequencing data such as TCGA. A great strength for the manuscript is that it clearly shows that Omada is capable of performing partitioning from PanCan into BRCA, COAD and LUAD (Fig 5), and datasets with no known groups (PAH and GUSTO), which is impressive and novel. I would like to praise the authors for coming up with such a tool, as the lack of a systematic tool dedicated to partitioning TCGA-like expression data is indeed a shortcoming in the field of medical genomics Overall, I believe the tool will be very valuable to the scientific community and could potentially contribute to meta-analysis of cohort RNA-Seq data. I only have a few comments regarding the methodology and manuscript. I also think that it should be more clearly stated that Omada is dedicated to large datasets (e.g. TCGA) and not differential expression analysis. I would also suggest benchmarking Omada to comparable tools via ROC curves if possible (see below). Methods: This section should be a bit more homogeneous between text descriptive and mathematical descriptive. It should specify what parts are automated and what part needs user input and refer to the vignette documentation. I also could not find the Omada github repository. Sample and gene expression preprocessing: To me, this section lacks methods/guidelines and only loosely describes the steps involved. "numerical data may need to be normalised in order to account for potential misdirecting quantities" - which kind of normalisation? "As for the number of genes, it is advised for larger genesets (>1000 genes) to filter down to the most variable ones before the application of any function as genes that do not vary across samples do not contribute towards identifying heterogeneity" What filtering is recommended? Top 5% variance? 1%? Based on what metric? Determining clustering potential: To me, it was not clear if this is automatically performed by Omada and how the feasibility score is determined. Intra-method Clustering Agreement: Is this from normalised data? Because affinity matrix will be greatly affected whether it's normalised or non-normalised data as the matrix of exponential(-normalised gene distance)^2 Spectral clustering step 2: "Define D to be the diagonal matrix whose (i, i)-element is the sum of A's i-th row": please also specify that A(i,j) is 0 in this diagonal matrix. Please also confirm which matrix multiplication method is used, product or Cartesian product? Also if there are 0 values, NAs will be obtained in this step. Hierarchical clustering step 5: "Repeat Step 3 a total of n − 1 times until there is only one cluster left." This is a valuable addition as this merges identical clusters, the methods should emphasise that the benefits of this clustering reduction method to help partition data, i.e. that this minimises the number of redundant clusters. Stability-based assessment of feature sets: "For each dataset we generate the bootstrap stability for every k within range". Here it should be mentioned that this is carried out by clusterboot, and the full arguments should be given for documentation "The genes that comprise the dataset with the highest stability are the ones that compose the most appropriate set for the downstream analysis" - is this the single highest or a gene list in the top n datasets? Please specify. Choosing k number of clusters: "This approach prevents any bias from specific metrics and frees the user from making decisions on any specific metric and assumptions on the optimal number of clusters.". Out of consistency with the cluster reduction method in the "intra-clustering agreement" section which I believe is a novelty introduced by Omada, and within the context of automated analysis, the package should also ideally have an optimized number of k-clusters. K-means clustering analysis is often hindered due to the output often resulting in redundant, practically identical clusters which often requires manual merging. While I do understand the rationale described there and in Table 3, in terms of biological information and especially for deregulated genes analysis (e.g. row z-score clustering), should maximum k also not be determined by the number of conditions, i.e 2n, e.g. when n=2, kmax=4; n=3, kmax=8? Test datasets and Fig 6: Please expand on how the number of features 300 was determined. While this number of genes corresponds to a high stability index, is this number fixed or can it be dynamically estimated from a selection (e.g. from 100 to 1000)? Results Overall this section is well written and informative. I would just add the following if applicable: Figure 3: I think this figure could additionally include benchmarking, ROC curves of. Omada vs e.g. previous TCGA clustering analyses (PMID 31805048) Figure 4: I think it would be useful to compare Omada results to previous TCGA clustering analyses, e.g. PMID 35664309 Figure 6: swap C and D. Why is cluster 5 missing on D)?

    6. Cohort studies increasingly collect biosamples for molecular profiling and are observing molecular heterogeneity. High throughput RNA sequencing is providing large datasets capable of reflecting disease mechanisms. Clustering approaches have produced a number of tools to help dissect complex heterogeneous datasets, however, selecting the appropriate method and parameters to perform exploratory clustering analysis of transcriptomic data requires deep understanding of machine learning and extensive computational experimentation. Tools that assist with such decisions without prior field knowledge are nonexistent. To address this we have developed Omada, a suite of tools aiming to automate these processes and make robust unsupervised clustering of transcriptomic data more accessible through automated machine learning based functions. The efficiency of each tool was tested with five datasets characterised by different expression signal strengths to capture a wide spectrum of RNA expression datasets. Our toolkit’s decisions reflected the real number of stable partitions in datasets where the subgroups are discernible. Within datasets with less clear biological distinctions, our tools either formed stable subgroups with different expression profiles and robust clinical associations or revealed signs of problematic data such as biased measurements.

      This work has been peer reviewed in GigaScience (see paper), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer name: **Ka-Chun Wong ** (Original submission) The authors have proposed a tool to automate the unsupervised clustering of RNA-seq data. They have adopted multiple testing to ensure the robustness of the identified cell clusters. The identified cell clusters have been validated across different molecular dimensions with sound insights. Overall, the manuscript is well-written and suitable for GigaScience in 2023. I have the following suggestions: 1. It is very nice for the authors to have released the tool in BioConductor. I was wondering if the authors could also highlight it at the end of abstract, similar to the Oxford Bioinformatics style? It could attract citations. 2. The authors have spent significant efforts on validating the identified clusters from different perspectives. However, there are many similar toolkits. Comparisons to them in both time, userfriendliness, and memory requirement would be essential. 3. Since the submitting journal is GigaScience, running time analysis could be necessary to assess the toolkit's scalability performance in the context of big sequencing data. 4. Single-cell RNA-seq data use cases could also be considered in 2023.

    1. Editors Assessment:

      Oxford nanopore direct RNA sequencing (DRS) is a relatively new sequencing technology enabling measurements of RNA modifications. In vitro transcription (IVT)-based negative controls (i.e. modification-free transcripts) are a practical and targeted control for this direct sequencing, providing a baseline measurement for canonical nucleotides within a matched and biologically-derived sequence context. This work presents exactly this type of a long-read, multicellular, poly-A RNA-based, IVT-derived, unmodified transcriptome dataset. Review flagging more statistical analyses needed be performed for the data quality, and this was provided. The resulting data providing a resource to the direct RNA analysis community, helping reduce the need for expensive IVT library preparation and sequencing for human samples. And also serving as a framework for RNA modification analysis in other organisms.

      This evaluation refers to version 1 and 2 of the preprint

    2. ABSTRACTNanopore direct RNA sequencing (DRS) enables measurements of RNA modifications. Modification-free transcripts are a practical and targeted control for DRS, providing a baseline measurement for canonical nucleotides within a matched and biologically derived sequence context. However, these controls can be challenging to generate and carry nanopore-specific nuances that can impact analysis. We produced DRS datasets using modification-free transcripts from in vitro transcription (IVT) of cDNA from six immortalized human cell lines. We characterized variation across cell lines and demonstrated how these may be interpreted. These data will serve as a versatile control and resource to the community for RNA modification analysis of human transcripts.

      This work has been published in GigaByte Journal under a CC-BY 4.0 license (https://doi.org/10.46471/gigabyte.129), and has published the reviews under the same license. These reviews are as follows:

      Reviewer 1. Joshua Burdick

      Is the language of sufficient quality?

      Yes. In line 284, "bioinformatic" may be more often used than "BioInformatic", but the meaning is clear.

      Are the data and metadata consistent with relevant minimum information or reporting standards?

      Yes. Presumably the files (e.g. eventalign data) which are not in SRA will need to be uploaded to the GigaByte site.

      Is there sufficient detail in the methods and data-processing steps to allow reproduction?

      Yes. Line 177 should presumably be "nanopolish evenetalign".

      Is there sufficient data validation and statistical analyses of data quality?

      Yes. In my opinion, Figure 3(A) nicely illustrates the uncertainty in current nanopore data, which is useful.

      Additional Comments:

      The RNA samples, and nanopore sequencing data, should be useful as a negative control. Sequencing these IVT RNA samples using the newer ONT RNA004 pore and kit might also be useful.

      Reviewer 2. Jiaxu Wang

      Is there sufficient data validation and statistical analyses of data quality?

      No. The authors ran DSR for the in vitro transcribed transcriptional RNAs from 6 cell lines to remove the possible natural modifications. The data can be used as a control RNA pool for natural or artificial modification studies. however, more statistical analyses should be performed for the data quality. see comments below: (1) For more possible usage of this data, some QC analysis is better to be provided to confirm the quality of these sequencing data. For example: 1) What is the correlation between in vitro transcribed transcriptional RNAs and original DSR for each cell line? 2) how many genes have been captured in each cell line? (2) In Figure 2B, the author provides 3 conditions for ‘exclude’ and ‘include’, some statistical analysis should be performed to confirm how many cases in condition 1, condition 2, and condition 3. How many mismatches are showing in only 1 cell line, some cell lines or all the cell lines? The shared correct genes may be more confident references for the modification analysis. (3) Different reads of the same gene could have different mismatches in the IVT RNAs due to RT-PCR bias or other reasons (especially for the lower expressed RNAs), for example, there are 100 reads in total, 90 reads are the correct nucleotide at a given position, 10 reads have a mismatch in the IVT sample, then how to define the signal as the control reference? Given that the nature modification is low in RNA, some threshold should be applied for the confident result, for example, what is the lowest expression threshold that could be used as a confident control reference?

      Is there sufficient information for others to reuse this dataset or integrate it with other data?

      No. For more possible usage of this data, more QC data should be performed, please refer to my above comments.

      Re-review: I am happy to see the changes. Thanks!

    1. Editors Assessment:

      This paper presents a new tool to make using PhysiCell easier, which is an open-source, physics-based multicellular simulation framework with a very wide user base. PhysiCell Studio is a graphical tool that makes it easier to build, run, and visualize PhysiCell models. Over time, it has evolved from being a GUI to include many additional functionalities, and can be used as desktop and cloud versions. This paper outlines the many features and functions, the design and development process behind it, and deployment instructions. Peer review improved the organisation of the various repositories and adding both a requirements.txt and environment.yml files. Looking to the future the developers are planning to add new features based on community feedback and contributions, and this paper presents the many code repositories if readers wish to contribute to the development process.

      This evaluation refers to version 1 of the preprint

    2. AbstractDefining a multicellular model can be challenging. There may be hundreds of parameters that specify the attributes and behaviors of objects. Hopefully the model will be defined using some format specification, e.g., a markup language, that will provide easy model sharing (and a minimal step toward reproducibility). PhysiCell is an open source, physics-based multicellular simulation framework with an active and growing user community. It uses XML to define a model and, traditionally, users needed to manually edit the XML to modify the model. PhysiCell Studio is a tool to make this task easier. It provides a graphical user interface that allows editing the XML model definition, including the creation and deletion of fundamental objects, e.g., cell types and substrates in the microenvironment. It also lets users build their model by defining initial conditions and biological rules, run simulations, and view results interactively. PhysiCell Studio has evolved over multiple workshops and academic courses in recent years which has led to many improvements. Its design and development has benefited from an active undergraduate and graduate research program. Like PhysiCell, the Studio is open source software and contributions from the community are encouraged.

      This work has been published in GigaByte Journal under a CC-BY 4.0 license (https://doi.org/10.46471/gigabyte.128), and has published the reviews under the same license. This is part of the PhysiCell Ecosystem Series: https://doi.org/10.46471/GIGABYTE_SERIES_0003

      Reviewer 1. Meghna Verma:

      Is installation/deployment sufficiently outlined in the paper and documentation, and does it proceed as outlined?

      The authors have provided links for video descriptions for installation and that is appreciated.

      One overall recommendation is: If all the screenshots (for e.g.: from Fig 1-12 of the main paper and all the subsections in Supplementary) can be combined in one figure that will help enhance the complete overview and the overall flow of the paper.

      Additional comments are available here: https://gigabyte-review.rivervalleytechnologies.comdownload-api-file?ZmlsZV9wYXRoPXVwbG9hZHMvZ3gvVFIvNTA3L1Jldmlld19QaHlzaUNlbGxTdHVkaW9fTVYucGRm

      Reviewer 2. Koert Schreurs and Lin Wouters supervised by Inge Wortel

      Is there a clear statement of need explaining what problems the software is designed to solve and who the target audience is?

      The problem statement is addressed in the introduction, which mentions the need for a GUI tool as a much more accessible way to edit the XML-based model syntax. However, it is somewhat confusing who exactly the intended audience of the paper is. Is the paper targeted at researchers that already use PhysiCell, but might want to switch to the GUI version? Or should it (also) target the potential new user-base of researchers interested in using ABMs, for whom the XML version was not sufficiently accessible and who will now gain access to these models because there is a GUI? Specifying the intended audience might impact some sections of the paper. For example, for users who already use PhysiCell, the step-by-step tutorials might not be useful since they would already know most of the available options; they would just need a quick overview of what info is in which tab. But if the paper is (also) targeted at potential new users, then some additional information could make both the paper and the tool much more accessible, such as:
      
      • A clear comparison to other modeling frameworks and their functionalities. Why should they use PhysiCell instead of one of the other available (GUI) tools? For example, the referenced Morpheus, CC3D and Artistoo all focus on a different model framework (CPMs); this might be worth mentioning. And what about Chaste? Does it represent different types of models, or are there other reasons to consider PhysiCell over Chaste or vice versa? For new users, this would be important information to include. The paper currently also does not mention other frameworks except those that offer a GUI. While the main point of the paper is the addition of the GUI, for completeness sake it might still be good to mention a broader overview of ABM frameworks and how they compare to PhysiCell, or simply to refer to an existing paper that provides such an overview.
      • The current tutorial immediately dives into very specific instructions (what to click and exact values to enter), often without explaining what these options mean or do. New users would probably appreciate to get a rough outline of which types of processes can be modelled, and which steps they would take to do so. This could be as easy as summarising the different main tabs before going into the details. I understand that some of these explanations will overlap with the main PhysiCell software – but considering that the GUI will open up modelling to a different type of community, it might make sense to outline them here to get a self-contained overview of functionality.
      • Indeed, if the above information is provided, the detailed tutorial might fit better as an appendix or in online documentation. That would also leave more space to explain not only which values to enter, but also what these variables do, why choose these values, what other options to consider, etc. Having this information together in one place would be very useful for beginning users.

      Is the source code available, and has an appropriate Open Source Initiative license been assigned to the code?

      The software is available under the GPL v3 licence.

      As Open Source Software are there guidelines on how to contribute, report issues or seek support on the code?

      There is a Github repository, ensuring that it is possible to contribute and report issues, and the paper explicitly invites community contributions. However, although the paper mentions that it is possible to seek support through Github Issues and “Slack channels”, we could find no link to the latter resource. This should probably be added to make this resource usable for the reader (or otherwise the statement should be removed)

      Is installation/deployment sufficiently outlined in the paper and documentation, and does it proceed as outlined?

      Mostly yes, as installation and deployment are outlined in the paper and documentation. However, we did notice a couple of issues: - The studio guide explains how to compile a project in PhysiCell (https://github.com/PhysiCell-Tools/Studio-Guide/blob/main/README.md), but does not mention that Mac users need to specify the g++ version at the top of the Makefile. This is explained in a separate blog (http://www.mathcancer.org/blog/setting-up-gcc-openmp-on-osx-homebrew-edition/) but should be outlined (or at least referenced) here as well. - There are several different resources covering the installation process, referring to e.g. github.com/physicell-training, github.com/PhysiCell-Tools/Studio-Guide, and the abovementioned blog. But this might not be very accessible to all users targeted by the new GUI functionality (especially when command line interventions and manual Makefile edits are involved). While not all of this has to be changed before publication, having all information in one place would already improve accessibility to a larger user-base. - When following the instructions (https://github.com/PhysiCell-Tools/Studio-Guide/blob/main/README.md), “python studio/bin/studio.py -p -e virus-sample” the -p flag gives an error: “Invalid argument(s): [‘-p’]”. We assumed it has to be left out, but perhaps the docs have to be updated.

      Is the documentation provided clear and user friendly?

      Mostly yes, as there is already a lot of documentation available. However, the user-friendliness could be improved with some minor changes. For example, the documentation could be made more user-friendly if resources were available from a central spot. Currently, information can be found in different places: - https://github.com/PhysiCell-Tools/Studio-Guide/blob/main/README.md provides installation instructions and a nice overview of what is where in the GUI, but as mentioned above, does not mention potential issues when installing on MacOS. - The paper provides very detailed examples; these might be nice to include along with the abovementioned overview. - Potentially other places as well. It would be great if the main documentation page could at least link to these other resources with a brief description of what the user will find there. Further, some additions would make the documentation more complete: - It would be good to have an overview somewhere of all the configuration files that can be supplied/loaded (e.g. those for “rules” and for initial configurations). - A clearer instruction/small tutorial on how to use simularium and paraview with physicell studio; especially for paraview there is no instruction on how to use your own data or make your own `.pvsm` file In the longer term, it might be worthwhile to set up a self-contained documentation website (this is relatively easy nowadays using e.g. Github pages), which can outline dependencies, installation instructions, a quick overview, detailed tutorials, example models, links to Github issues/slack communities. This is not a requirement for publication but might be worth looking into in the future as it would be more user-friendly.
      

      Is there a clearly-stated list of dependencies, and is the core functionality of the software documented to a satisfactory level?

      No. The core functionality of the software is nicely outlined in the Github README (https://github.com/PhysiCell-Tools/Studio-Guide/blob/main/README.md), but as mentioned before, this high-level overview is missing in the paper itself. The README and paper recommend installing the Anaconda python distribution to get the required python dependencies. This is fine, but adding a setup file or requirements.txt might still be useful for users who are more familiar with python and want a more minimal installation. Providing a conda environment.yml that allows running the studio along with paraview and/or simularium might also be helpful. Note that running the studio with simularium in anaconda did not work because anaconda did not have the required vtk v9.3.0; instead we had to install simularium without anaconda (“pip3 install simularium”).

      Are there (ideally real world) examples demonstrating use of the software?

      The detail tutorial nicely walks the reader through the tool (although as mentioned before, a high-level overview is missing and the level of detail feels slightly out of place in the paper itself). When walking through the example in the paper and the supplementary, we did run into a few (minor) issues: - It might be good to stress explicitly that after copying the template.xml into tumor_demo.xml, the first step is always to compile using “make”. The paper mentions “Assuming … you have compiled the template project executable (called “project”) …”. But it might not be immediately clear to all users how exactly they should do so (presumably by running “make tumor_demo” after copying the xml file?). - When running “python studio/bin/studio.py -c tumor_demo.xml -e project” as instructed, a warning pops up that “rules0.csv” is not valid (although the tool itself still works). - The instructions for plotting say to press “enter” when changing cmin and cmax, but Mac offers only a return key. Pressing fn+return to get the enter functionality also does not work; it might be good to offer an alternative for Mac. - When reproducing the supplementary tutorial, results were slightly different. It might be good if the example would offer a random seed so that users can verify that they can reproduce these results exactly. In our hands, when reproducing figs 39, 40, 48, 49 yields way more (red) macrophages (even when running multiple times), but we could not be sure if this is due to variation between runs, or a mistake in the settings somewhere.
      
      
      The paper mentions that they have started setting up automated testing, but it does not give an idea of what the current test coverage is. Did they add a few tests here and there, or start to systematically test all parts of the software? I understand the latter might not be achievable immediately, but it would be good if users and/or contributors can at least get a sense of how good the current coverage is. (Note: the framework uses pytest, which seems to offer some functionality to generate coverage reports, see e.g. https://www.lambdatest.com/blog/pytest-code-coverage-report/). The code in studio_for_pytest.py has a comment “do later, otherwise problems sometimes”, but it is not entirely clear if the relevant issue has been resolved.
      

      Additional Comments: The presented tool offers a GUI interface to the PhysiCell framework for agent-based modeling. As outlined for the paper, this offers significant value to the users since editing a model is now much more accessible. The tool comes with extensive functionality and instructions. Overall, the tool functions as advertised, and will be of great value to the community of PhysiCell users that now have to edit XML files by hand. It is therefore (mostly) publishable as is if some of the issues with installation (mentioned above) can be straightened out. That said, we do think some improvements could make both the tool and the paper more accessible to a larger user audience. Most of these have been mentioned in the other questions, but we will list some additional ones below. Note that many of these are just suggestions, so we will leave it up to the authors if and when they implement them.

      Suggestions for the paper: While the paper nicely outlines design ideas and usage of the tool, there were some points where we felt that the main point did not quite come across, for example: - As mentioned in the question about problem statement and intended audience, adding some information to the paper would make it a more useful resource to users not yet familiar with PhysiCell (see remarks there). - The section “Design and development” describes the development history of the tool. In principle this is a valuable addition, because it illustrates how the project is under ongoing development and has already been improved several times based on feedback of users. However, the amount of information on each previous stage is slightly confusing; it is not entirely clear how this relates to the paper and current tool. If the main point is to showcase that the current tool has been built based on practical user experiences, this would probably come across better if this section was somewhat shorter and focused on the design choices rather than previous versions. If the main point is something else, it should be clarified what the main idea is. – The point of Table 1 was unclear to us – consider removing or explaining the main idea. - Several figures do not have captions (e.g. Figure 1 but also others); it would be helpful to clarify what message the figure should convey. – P4 “adjust the syntax for Windows if necessary” – is it self-explanatory how users should adjust? Consider adding the correct code for windows as well if possible, since users that want to use the GUI tool might not be familiar with command line syntax. - P6 “if you create your own custom C++ code referring directly to cell type ID” – this functionality is never discussed. This might be part of the general PhysiCell functionality, but it would be good to at least provide a link to a resource on how you could do this. - P8 “Only those parameters that display … editing the C++ code” – it was not entirely clear to me what this means, could you clarify? - P13 mentions you can immediately see changes to the model parameters made. This is very useful for prototyping when users want immediate feedback. However, what happens when you try to save output for a simulation where parameters were changed while the simulation was running? Would users be reminded that their current output is not representative? - Discussion: it is good to mention that the tool is already being used. Can you give an indication based on your experience how long it takes new users to learn to navigate the tool? This might be useful information to add in the paper. - The last statement on LLMs seems to come out of nowhere. Consider leaving it out or expanding further on what would be needed to make this work/how feasible this is.

      Further comments on the tool itelf: - The paper mentions that results may not be fully reproducible if multiple threads are used (I assume this is the case even when a random seed is set). In this case, would it make sense to throw a warning the first time a user tries to set a seed with multiple threads, to avoid confusion as to why the results are not reproducible? - Unusable fields are not always greyed out to indicate that they are disabled, which sometimes makes it seem as though the tool is unresponsive. In other places unusable options are set to grey, so it might be good to double-check if this is consistent. - At the initial conditions (IC) page there is no legend; it might be good to add one. - There are some small inconsistencies between the field names mentioned in the paper and those in the tool/screenshots. For example “boundary condition” (p5) should be “dirichlet BC”, “uptake” (p6) should be “uptake rate”. For the latter, the paper mentions that the length scale is 100 micron but this should be visible in the tool as well. - Not all fields have labels, so it is not always clear what the options do (see e.g. drop-downs in Figure 6). – There are a few points in the tool where you have to “enable” a functionality before it works, but this might not always be intuitive. For example, if you upload a file with initial conditions, it can be assumed that you want to use it. There might be good reasons for this in some cases but in general, consider if all these checkpoints are necessary or if this could be simplified. Same goes for the csv files that have to be saved separately instead of through the main “save” button – in the long term it might be worth saving all relevant files when they are updated, or at least throwing a warning that you have to save some of them separately.

    1. Editors Assessment:

      Many studies have explored the genetic determinants of COVID-19 severity, these GWAS studies using microarrays or expensive whole-genome sequencing (WGS). Low-coverage WGS data can be imputed using reference panels to enhance resolution and statistical power while maintaining much lower costs, but imputation accuracy is difficult to balance. This work demonstrates how to address these challenges utilising the GLIMPSE1 algorithm, a less resource-intensive tool that produces more accurate imputed data than its predecessors. Generating a dataset containing 79 imputed low-coverage WGS samples from patients with severe COVID-19 symptoms during the initial wave of the SARS-CoV-2 pandemic in Spain. The validation of this imputation and filtering process shows that GLIMPSE1 can be confidently used to impute variants with minor allele frequency up to approximately 2%. After peer review the authors clarified and provided more validation and statistics and figures to help convince this approach was valid. This work showcasing the viability of using low-coverage WGS imputation to generate data for the study of disease-related genetic markers, alongside a validation methodology to ensure the accuracy of the data produced. Helping inspire confidence and encouraging others to deploy similar approaches to other infectious diseases, genetic disorders, or population-based genetic studies. Particularly in large-scale genomic projects and resource-limited settings where sequencing at higher coverage could prove to be prohibitively expensive.

      This evaluation refers to version 1 of the preprint

    2. AbstractDespite advances in identifying genetic markers associated to severe COVID-19, the full genetic characterisation of the disease remains elusive. This study explores the use of imputation in low-coverage whole genome sequencing for a severe COVID-19 patient cohort. We generated a dataset of 79 imputed variant call format files using the GLIMPSE1 tool, each containing an average of 9.5 million single nucleotide variants. Validation revealed a high imputation accuracy (squared Pearson correlation ≈0.97) across sequencing platforms, showing GLIMPSE1’s ability to confidently impute variants with minor allele frequencies as low as 2% in Spanish ancestry individuals. We conducted a comprehensive analysis of the patient cohort, examining hospitalisation and intensive care utilisation, sex and age-based differences, and clinical phenotypes using a standardised set of medical terms developed to characterise severe COVID-19 symptoms. The methods and findings presented here may be leveraged in future genomic projects, providing vital insights for health challenges like COVID-19.

      This work has been published in GigaByte Journal under a CC-BY 4.0 license (https://doi.org/10.46471/gigabyte.127 ), and has published the reviews under the same license. For a video summary from the author see: https://youtu.be/x6oVzt_H_Pk?si=Byufhl0mIL3h0K6u

      The reviews are as follows:

      Reviewer 1. Jong Bhak:

      Severe cases of covid-19 patients are critical data. This manuscript deals with detailed clinical information genome set as a subset of exome sequences and provide invaluable data for on-going global covid-19 omics studies.

      Reviewer 2. Alfredo Iacoangeli:

      The authors present the release of a new dataset that include low coverage WGS data of 79 individuals who experienced severe covid-19 in Madrid (Spain). The authors processed the data and imputed common variants and they are making this dataset available to the scientific community. They also present the clinical data of these patients in a descriptive and informative fashion. Finally, the authors also validated the quantify of their imputation, showcasing the potential of low coverage WGS as an alternative to microarrays. Overall the manuscript is written very well, clear, and exhaustive. The data is certainly valuable. Its generation and processing and analysis appears robust.
      

      Overall I support the publication of this article and dataset. I only have a small number of minor suggestions for the authors: The sentence "Traditionally, the genotyping process has relied on array technologies as the standard, both at the broader GWAS level and the more specific genetic scoring and genetic diagnostics levels" sounds a little off. I totally understand where the authors come from but given the central role of NGS and Sanger for genetic diagnostics I would suggest the authors to modify accordingly or to keep the GWAS focus.

      Please double-check the use a statistical terms in the description of the imputed data. For example: "On average, each VCF file in this rich dataset contains 9.49 million high-confidence single nucleotide variants [95%CI: 9.37 million - 9.61 million] (Figure 1)." The use of CI in this context is a little miss-leading as it is not strictly referring to a distribution of probability but to a finite collection. A range would be more appropriate. The authors say that they examined the ethnicity of the 79 individuals, however I do not think the ancestry is actually reported anywhere while a few figures show ancestral population data. The authors might clarify or correct the terminology.

      Looking at figure 2 the sentence " although the male age distribution exhibits a broader range and higher variability, suggestive of a greater" does not appear justified. The authors might want to clarify or correct accordingly.

      The sentence "This exploratory analysis highlights the diverse ways in which severe COVID-19 can present, and the importance of comprehensive and nuanced clinical phenotyping in improving our understanding and management of the disease." suggests some basic clustering might be useful. The readers might benefit from a couple of graphs or figures quantifying the overlap of the SNPs across samples and maybe one that shows the density of SNPs across the genome.

    1. this pathogen, coinciding with a progressive shrinking of the degradative arsenal and expansions in lineage specific genes. Comparative transcriptomics of four reference species with different evolutionary histories and adapted to different hosts revealed similarity in gene content but differences in the modulation of their transcription profiles. Only a few orthologs show similar expression profiles on different plant cell walls. Combining genome sequences and expression profiles we identified a set of core genes, such as specific transcription factors, involved in plant cell wall degradation in Colletotrichum.Together, these results indicate that the ancestral Colletotrichum were associated with eudicot plants and certain branches progressively adapted to different monocot hosts, reshaping part of the degradative and transcriptional arsenal.

      Reviewer 2: Nicolas Lapalu This manuscript describes the adaptation of the Colletotrichum genus to monocotyledonous and dicotyledonous plants with regard to the content and expression of genes from 30 genomes, with a subsampling of 4 genomes for transcriptomic analyses. Major remarks: "Considering that the analyses carried out are affected by the sampling, as closely related species are likely to have more shared genes compared to species that are more distant from others," Yes, Indeed, it's clearly a possible bias due to the sampling, as you write. As you considered all genomes together to define specific genes, monocot specific species have few specific genes due to their phylogenetic proximity. Based on this, could you address these observations based on combination of figure 1 and 2: 1. The number of specific genes in C.eremochloae (1608) vs in C.sublineola (1643), while divergence time between both seems short and similar to the group of C.lupini, C.costaricense (monoct) … with approximatively 100 genes specific to each species. How could such closely related genomes have acquired so many specific genes in such a short time compared with other species during the same period of evolution? 2. Same remark for in C.phormii (911) vs C.salicis (286), when it's even more disturbing with the switch to dicot and a loss of many genes for C.salicis. For both cases mentionned above, a detailed comparison between the two genomes could be useful to obtain some explanations of the events and genes involved. Moreover, interpretation of the phylogenetic tree (Figure 1), could be lead to propose three clusters of genomes, based on evolution time and plant host: Monocot, Dicot "old" (C.orbiculare, C.noveboracense, …) and Dicot "young" (C.melonis, C.cuscutae, …). Did the authors attempt an analysis with a such view of the data? Maybe that will complete the view of C.acutatum complex (46 genes) vs C.graminciola complex (28 genes) form which C.orchidophylum and C.phormis are excluded. Finally, one of the most interesting thing is the proximity of C. phormii and C.salicis in the same clade but with a recent host specialization. Despite the poor quality of the genome of C.salicis vs C.phormii, an comparative genomic approach with a tool like Synchro could provide clues as to gene losses and their location (all along the genome/ specific regions ). Figure 3: Please explain further Figure 3 A, described as a PCA. No axis (dimension) has been shown with a % explaining the divergence between organisms. This is confusing and does not allow me to know whether the gene sets used to compare the 4 genomes are only shared genes or all genes. The rest of the figure is much clearer and the comments are clear on the response to species specificity (under/over expression of genes) for each genome. Figure 4: "the expression of the orthologous genes was clustered for the four fungal species (Figure 4A)" As written, it is assumed that you used ortholog genes established between the 4 species, this does not appear to be the case with so many genes missing in C.graminicola in figure 4. To continue on this point, I have not found the minimum number of species found in a cluster to set a cluster of orthologs (maybe written but not found). What is the threshold for divergence or sequence similarity? Have you considered sequence length (query coverage vs subject coverage) to allow clustering of potentially split/fragmented genes in annotations? Minor remarks: The authors limit their analysis to 30 genomes, whereas more than 270 genomes of Colletotrichum are available, from over 70 species. Research time is clearly longer than the time to generate genomic resources, but it could be interesting to list a few new genomes missing from those analysed and that could have significant added value (particularly if sequenced in long reads, providing complete genomes). Transcriptomic analyses were carried out on 4 genomes. The choice of the genomes was not discussed, and maybe done by convenience with strains available at the lab. In fact, C.higginsianum is well sequenced, assembled and studied and chosen as one of the specific hosts of dicotyledons, whereas it is a member of the C.destructivum complex. Similarly, C.phormii appears to be a recent species with an adaptation to monocots. L 113 : "species with bigger genomes are characterized by a lower GC content", please rewrite the link between genome size and GC content. Between species of same genus genome size is most often linked to the invasion of TE element (RIPed or not in fungi). Strongly ripped genomes (Leptosphaeria, Venturia) are not always large compared to the size of other species. Data availability: All genomes were released in public Databases. I do not find accession numbers for RNA-Seq runs. Many supplementary details have been provided. I appreciate the BUSCO logs for checking the completeness of gene sets, which provide me some clues about the quality of genome annotation, that was never discussed or pointed out in the manuscript as possible source of bias. Overall, the manuscript is very interesting and confirms the results previously identified in terms of specificities of CAZy families associated with host plant adaptation in the Colletotrichum genus. The authors demonstrate a great knowledge of the CAZome and associated biological processes, which provides a great deal of valuable information for the community working on Colletotrichum and more generally for all those working on such enzymes. Finally, the transcriptomic data suggest that species specificity and host adaptation are more related to an expression pattern than to specific gene content, than a specific gene content.

    2. Colletotrichum fungi infect a wide diversity of monocot and eudicot hosts, causing plant diseases on almost all economically important crops worldwide. In addition to its economic impact, Colletotrichum is a suitable model for the study of gene family evolution on a fine scale to uncover events in the genome that are associated with the evolution of biological characters important for host interactions. Here we present the genome sequences of 30 Colletotrichum species, 18 of them newly sequenced, covering the taxonomic diversity within the genus. A time-calibrated tree revealed that the Colletotrichum ancestor diverged in the late Cretaceous around 70 million years ago (mya) in parallel with the diversification of flowering plants. We

      Reviewer 1: Jamie McGowan In this study, Baroncelli and colleagues carry out a comprehensive analysis of genomic evolution in Colletotrichum fungi, an important group of plant pathogens with diverse and economically significant hosts. Their comparative genomic and phylogenomics analyses are based on the genome sequences of 30 Colletotrichum species spanning the diversity of the genus, including pathogens of dicots, monocots, and both dicots and monocots. This includes 18 genome sequences that are newly reported in this study. They also perform comparative transcriptomic analyses of 4 Colletotrichum species (2 dicot pathogens and 2 monocot pathogens) on different carbon sources. Overall, I thought the manuscript was very well written and technically sound. The results should be of interest to a broad audience, particularly to those interested in fungal evolutionary genomics and plant pathology. I only have a few minor comments. Minor comments: (1) Lines 50 - 51: "The plant cell wall (PCW) consists of many different polysaccharides that are attached not only to each other through a variety of linkages providing the main strength and structure for the PCW". I found this confusing - is the sentence incomplete? (2) Line 66: "Some Colletotrichum species show…" I think there should be a couple of introductory sentences about Colletotrichum before this. (3) Figure 1: It would be informative to label which genomes were sequenced with PacBio versus just Illumina. (4) Lines 254 - 255: "As no other enrichment was identified we performed a manual annotation of genes identified in Figure 3D". I don't think it is clear here what manual annotation this is referring to. (5) One area where I felt the analysis was lacking was the lack of analyses on genome repeat content. The authors highlight the large variation in genome sizes within Colletotrichum species (~44 Mb vs ~90 Mb) and show in Figure 1 that this correlates with increased non-coding DNA. It would have been interesting to determine if this is driven by the proliferation of particular repeat families. (6) Another concern is the inconsistent use of genome annotation methods. 12 of the genomes reported in this study were annotated using the JGI annotation pipeline, whereas the other 6 were annotated using the MAKER pipeline. Several studies (e.g., Weisman et al., 2022 - Current Biology) show that inconsistent genome annotation methods can inflate the number of observed lineage specific genes. The authors may wish to comment on this or demonstrate that this isn't an issue in their study (e.g., by aligning lineage specific proteins against the other genome assemblies).

    1. respectively. Focusing on inversions and translocations, symmetric SVs which are readily genotyped within both populations, 24 were found to be structural divergences, 2,623 structural polymorphisms, and 928 shared structural polymorphisms. We assessed the functional significance of fixed interspecies SVs by examining differences in estimated recombination rates and genetic differentiation between species, revealing a complex history of natural selection. Shared structural polymorphisms displayed enrichment of potentially adaptive genes.

      Reviewer 2: Lejun Ouyang Structural variation plays an important role in the domestication and adaptability of species. The author compared the structural variation between E. melliodora and E. sideroxylon populations. This is a very interesting study, but it feels that the author is just statistical data. However, the biological problems caused by these differences have not been condensed, such as the impact of structural variation on recombination. What effect does it have on the differentiation of the two populations? Is it promoting or inhibiting? Secondly, the author's writing is not very clear, and some of the results are described too simply, resulting in unclear conclusions. When formatting pictures, try to avoid nesting pictures, and use A, B, C, etc. to represent them. However, some obvious issues, but not limited, are listed above. Here are other minor issues: 1. Lines 62-64: References are required. 2. Lines 145-150: It is recommended to put it in the materials and methods section. 3. The Synteny and structural variation annotation section requires a detailed explanation of the results in Figure 2 and Table 2. 4. It is recommended to make Table 2 into a picture, the effect will be better. 5. The form should be a three-line grid. 6. Why does the recombination rate in Table 3 have positive and negative errors at the genome level, but only negative errors at the chromosome average level? 7. 219-220 It is recommended that methods not appear in the results section. It is recommended to put it in the methods section. 8. The Structural variation genotyping in the results section needs to be modified. 9. Figure 6 is a bit confusing. It is recommended to revise it to make it clearer. 10. The results section of Figure 7 is not clearly described and the notes are not clear. What do the different colors represent? 11. Lines 263-264: It is recommended that methods should not appear in the results section, but can be placed in the materials and methods section. 12. It is recommended that Figure 8 be divided into Figure 8A and Figure 8B. Try not to have pictures within pictures, which can easily lead to unclear references. 13. Lines 276-281: It is recommended to put it in the method section. 14. Lines 289-290: It is recommended to put it in the method section. 15. Lines 307-308: E. melliodora and E. sideroxylon italics 16. Lines 311-318, lines 320-321: It is recommended to put them in the method section. 17. Lines 338-339: E. melliodora and E. sideroxylon italics. 18. Line 342: It is recommended to put it in the discussion. 19. It is recommended to change Figure 9B, Figure 10B and Figure 11B to Figure 20. Line 561: Add references.

    2. Structural variants (SVs) play a significant role in speciation and adaptation in many species, yet few studies have explored the prevalence and impact of different categories of SVs. We conducted a comparative analysis of long-read assembled reference genomes of closely related Eucalyptus species to identify candidate SVs potentially influencing speciation and adaptation. Interspecies SVs can be either fixed differences, or polymorphic in one or both species. To describe SV patterns, we employed short-read whole-genome sequencing on over 600 individuals of E. melliodora and E. sideroxylon, along with recent high quality genome assemblies. We aligned reads and genotyped interspecies SVs predicted between species reference genomes. Our results revealed that 49,756 of 58,025 and 39,536 of 47,064 interspecies SVs could be typed with short reads, in E. melliodora and E. sideroxylon

      Reviewer 1: Jakob Butler Ferguson et al have performed a thorough analysis of two species of Eucalyptus, quantifying the extent of structural variation between assembled genomes of the species and determining how prevalent those variations are across a selection of wild material. I believe this study is of sufficient quality for publication in GigaScience, if some minor inconsistencies and grammatical issues are addressed, and a few supporting analyses are performed. The major changes I would like to see include the addition of a syri plot of the complete set of SVs between E. melliodora and E. sideroxylon. I believe this, along with correcting the scale on the plots of recombination in Figure S6/S7 would allow for a better comparison of how recombination rate is interacting with the SVs. I would also suggest a more formal test of enrichment for COG terms, to better support the statements of "enrichment" in the discussion. Suggested changes by line: Line 142 - This section is quite short, I would either merge this section into the Genome scaffolding (and annotation) section, or expand on the results of the gene annotation. Line 182 - (Supplementary Figure S4) Line 183 (and throughout) - Please be consistent with your references to tables and figures. Line 186 - delete comma after 28.63% Line 194 - These are density plots rather than histograms Figure 4 - Both axes are labelled as PC1 Line 217 (page 10, line numbers are doubled up) - This seems repetitive, perhaps "…especially as they may also represent divergent sequences". Line 221 (page 11) - Please insert "and" before polymorphic translocations Line 223 - You have stated that those not successfully genotyped in both species are private or artefacts earlier in the paragraph, please reduce the repetition. Figure 6 - I don't find this figure particularly informative (and somewhat confusing to interpret). I think showing the percentages of each different SV in a visual form implies a level of equivalence in genomic impact, which is difficult to reconcile with the raw difference in numbers. I think a supplemental table with the focus on the percentages would illustrate the point better. Line 246 - There is no mention in the methods about what r threshold was used to declare a pair "correlated", please state it here or in the methods. Line 265 - This line was confusing to interpret. A suggested alteration: "significant value. After attempting to functionally annotating all genes across the genome and placing them within COG categories, 247 of the total 281 gene candidates in SSPs were annotated. These genes were enriched for...." Line 266 - I would like to see a formal enrichment analysis rather than "increased/decreased association", so we could have a clearer picture of which gene functions are truly over/underrepresented in SSPs. You could subsequently limit Figure 8 to those that show a difference. Line 275 - The grammar of this title is a bit off, perhaps "Effect of syntenic, rearranged, unaligned regions and genes on recombination rates" Line 276 - This is the first mention of p, please define it as recombination rate Line 283 - The supplemental Figure S6 and S7 seem to have regions of heightened recombination, but this is difficult to interpret and compare with the current variable axis scales. Please make these consistent. I would also like to see the syri graph of the two aligned genomes, as this would allow for a visual comparison of SV regions with recombination rate. Line 290 - How were p-values adjusted? Line 294 - More information about this 'significantly' higher recombination rate would be good, either in the figure or further expanded in the text. Line 307 - Italics for species names (repeated in Figure 10 and Figure 11 caption) Line 310 - Similar problem to line 275 Figure 10 - Having Figure 9b repeated in Figure 10 and Figure 11 is unnecessary. Line 336 - Vertical lines show average FST, not p Line 341 - Similar problem to line 275 Line 356 - translocations should be plural Line 367 - Vertical lines show average SNP density, not p Line 391 - This is the first mention of barrier loci, please define Line 413 - As mentioned above, I would recommend a formal enrichment test to support this statement Line 428 - Grammar is poor here, please correct Line 490 - Please make this a complete sentence Line 499 - Please state how the Hi-C map was manually edited, and what informed the position of those edits. Line 508 - Please provide an example of how well your LAI score of ~18 compares. The LAI paper seems to intimate that 10 is low quality? Line 513 - Missing bracket for version number Line 536 - Syntenic rather than synteny Line 717 - Formatting error in references Supp table S3-S4-S5 - Space between E. and sideroxylon

    1. Python-based workflow for processing linked-read sequencing datasets, thereby filling the current gap in the field caused by platform-centric genome-specific linked-read data analysis tools.

      Reviewer 3: Dmitrii Meleshko The paper titled "LRTK: A Platform-Agnostic Toolkit for Linked-Read Analysis of Both Human Genomes and Metagenomes" by Yang et al. is dedicated to the development of a unified interface for linked-read data processing.The problem described in the paper indeed exists; each linked-read technology requires complex preprocessing steps that are not straightforward or efficient. The idea of consolidating multiple tools in one place, with some of them modified to handle multiple data types, is commendable. Overall, I am supportive of this paper. My main concern, however, is that the impact of linked-read applications in the paper appears to be exaggerated, and the authors need to provide more context in their presentation. Also, some parts of the paper are vague described. I will elaborate on my concerns in more detail below.X) Linked-read sequencing generates reads with high base quality and extrapolative 64 information on long-range DNA connectedness, which has led to significant 65 advancements in human genome and metagenome research[1-3]. - Citations 1-3 do not really tell about advancements in human genome and metagenome research, these are technologies papers. Similar problem can be found in "Despite the limitations that genome specificity…" paragraph. Authors cited and described several algorithms, that are not really genomic studies. E.g. "stLFR[2] has found application in a customized pipeline that has been developed to first convert its raw reads into a 10x-compatible format, after which Long Ranger is applied for downstream analysis." is not an example of genomic study, but a pipeline description.X) Table S1 does not improve the paper, I would say it does completely the opposite. LongRanger is not a toolkit, it should be considered as read alignment tool that outputs some SVs and haplotypes along the way. So LongRanger vs LRTK comparison does not make sense to me. There are other tools that solve metagenome assembly problem, human assembly problem, call certain classes of SVs etc.x) I think incorporating longranger is important, since its performance is reported to be better than EMA for human samples and it is also more popular than EMA. Is it possible and have you tried doing it?x) I would remove exaggerations such as "myriad" from the text. The scope of linked-reads is pretty limited nowadays. I agree that linked-reads might be useful in metagenomics/transcriptomics and other scenarios that were mentionedin the text, but the number of studies is very limited especially nowadays, and was not really big when 10X platform was on the risex) "LRTK reconstructs long DNA fragments" - when people talk about long fragment reconstruction, they usually mean moleculo-style reconstruction through assembly. This reconstruction resemble "barcode deconvolution", described in Danko et al, and Mak et al. So I would stick to this terminologyx) it is important to note that, Aquila, LinkedSV and VALOR2 are linked-read specific tools, while FreeBayes, Samtools and GATK are short-read tools. Also, provide target SV length for both groups of tools.x) There are some minor problems with Github readme. E.g. "*parameters". Also, I don't understand how to use conversion in real life… E.g. 10X Genomics data often comes as a folder with multiple gzipped R1/R2/I1 files. I don't understand how would I use it in that case.x) Please cite or explain why this is happening (not only when) - "A known concern with stLFR linked-read sequencing is the loss of barcode specificity during analysis."x) I don't understand what is "Length-weighted average (μFL) and unweighted average (WμFL) of DNA 688 fragment lengths." from the figure. One of them is just an average and what about second? Figure looks confusingx) LRTK supports reconstruction of long DNA fragments - this section describes something else. More about statistics and data QCx) LRTK promotes metagenome assembly using barcode specificity - please remove supernova, it was never a metagenomic assembler. Check cloudSPAdes insteadx) "The superior assembly performance we have observed" - superior compared to what? If so, some short-read benchmark should be included.x) "LRTK improves human genome variant phasing using long range information" - What dataset is this? What callset was used for ground truth? Briefly describe how comparisons were done?x) Figures 5F-G together are very confusing.First I don't expect tools like LinkedSV to have high recall (around 1.0) and low precision. Also, figure G is kind of subset of figure F, but results are completely different. Also use explicit notation. E.g. 50-1kbp and 1-10kbp mean completely different thingsx) We curated one benchmarking dataset and two real datasets to demonstrate the 307 performance of LRTK - what do you mean by "curation" herex) Why don't you use Tell-Seq barcode whitelist mentioned here - https://sagescience.com/wpcontent/uploads/2020/10/TELL-Seq-Software-Roadmap-User-Guide-2.pdfx) Tiered alignment approach is vaguely introduced. It is not clear what "n% most closely covered windows." mean, or how do we select a subset of reference genomes for the second phase

    2. benchmarking and three real linked-read data sets from both the human genome and metagenome. We showcase LRTK’s ability to generate comparative performance results from the preceding benchmark study and to report these results in publication-ready HTML document plots. LRTK provides comprehensive and flexible modules along with an easy-to-use

      Reviewer 2: Lauren Mak Summary: This manuscript describes the need for a generalized linked-read (LR) analysis package and showcases the package the authors developed to address this need. Overall, the workflow is welldesigned but there are major gaps in the benchmarking, analysis, and documentation process that need to be addressed before publication.Documentation:The purpose of multiple tool options: While the analysis package is technically sound, one major aspect is left unexplained- why are there so many algorithm options included without guidance as to which one to use? There are clearly performance differences by different algorithms (combinations of 2+ not considered either) on different types of LR sequence.Provenance of ATCC-MSA-1003: Nowhere in the manuscript is the biological and technical composition of the metagenomics control described. It would be helpful to mention that this is specifically a mock gut microbiome sample, as well as the relative abundances of the originating species as well as the absolute amounts of genetic material per species (ex. as measured by genomic coverage) in the actual dataset. As a corollary, there should be standard deviations in any figures that display a summary statistic (ex. Figure 3A- precision, recall, etc.) that seems to be averaged across the species in a sample. This includes Figure 3A and Figure 4A.Dataset details: There is no table indicating the number of reads for each dataset, which would be helpful in interpreting Figures 3 and 4.Open source?: However, there was no Github link provided, only a link to the Conda landing page. Are there thorough instructions provided for the package's installation, input, output, and environment management?Benchmarking:The lack of simulated tests: The above concern (expected performance on idealized datasets) is best addressed with simulated data, which was not done despite the fact that LRSim exists (and apparently the authors have written a tool for stLFR as well previously).Indels: What are the sizes of the indels detected? Why were newer tools, such as PopIns2, Pamir, or Novel-X not tried as well?Analysis:Lines 166-169: Figure 1 panel A1 vs. B1- why do the distribution of estimated fragment sizes from the 10x datasets look so different in metagenomic vs. human samplees, when there is reasonable consistency in TELL-Seq and stLFR datasets?Lines 182-184: Figure 3A- why is LRTK's taxonomic classification quality generally lower than the of the tools? At least in terms of recall, it should perform better as mapping reads to reference genomes should have a lower false negative rate than k-mer-based tools. Also, what is the threshold for having detect a taxon? Is it just any number of reads or is there a minimum bound?Lines 187-188: Figure 3B- at least 15% of each caller's set of variants is unique to the variant, while a maximum of 50% is universal. I'd not interpret that as consistency.Lines 192-193: Are you referring to allelic imbalance as it is popularly used to refer to expression variation between the two haplotypes of a diploid organism? This clearly doesn't apply in the case of bacteria. If this is not what you're referring to, please define and/or cite the applicable definition.Lines 201-208: It's odd that despite the 10x datasets having the largest estimated fragment size, they have some of the smallest genome fractions, NGA50, and NA50. Why is this? Are they just smaller datasets, on average?Miscellaneous:UHGG: Please mention the fact that the UHGG is the default database, as well as whether or not the user will be able to supply their own databases.Line 363: What does {M} refer to?Line 369: What does U mean here? Is this the number of uniquely aligned reads in one of the windows N that a multi-aligned read aligns to?Lines 371-372: What does 'n% most closely covered windows' refer to?Lines 399-405: How are SNVs chosen for MAI analysis from the three available SNV callers?Lines 653-656: Which dataset was used for quality evaluation?Line 665: What do the abbreviations BAF and T stand for?

    3. Linked-read sequencing technologies generate high base quality reads that contain extrapolative information on long-range DNA connectedness. These advantages of linked-read technologies are well known and has been demonstrated in many human genomic and metagenomic studies. However, existing linked-read analysis pipelines (e.g., Long Ranger) were primarily developed to process sequencing data from the human genome and are not suited for analyzing metagenomic sequencing data. Moreover, linked-read analysis pipelines are typically limited to one specific sequencing platform. To address these limitations, we present the Linked-Read ToolKit (LRTK), a unified and versatile toolkit for platform agnostic processing of linked-read sequencing data from both human genomes and metagenomes. LRTK provides functions to perform linked-read simulation, barcode error correction, read cloud assembly, barcode-aware read alignment, reconstruction of long DNA fragments, taxonomic classification and quantification, as well as barcode-assisted genomic variant calling and phasing. LRTK has the ability to process multiple samples automatically, and provides the user with the option to generate reproducible reports during processing of raw sequencing data and at multiple checkpoints throughout downstream analysis. We applied LRTK on two

      Reviewer 1: Brock Peters Yang et al. describe a package of tools, LRTK, for cobarcoded reads (linked reads) agnostic of library preparation methods and sequencing platforms. In general, it appears to be a very useful tool. I have a few concerns with the manuscript as it is currently written:1. Line 203 "With Pangaea,LRTK achieves NA50 values of 1.8 Mb and 1.2 Mb for stLFR and TELL-Seq sequencing data, respectively. On 10x Genomics sequencing data, Athena exhibited superior assembly performance, with a NGA50 of 245 Kb."This is a bit of an awkward two sentences as you are comparing NA50 values for stLFR and TELL-Seq and then NGA50 for 10X Genomics and it makes it sound like 10X Genomics performed the best. Also, these numbers don't seem to agree with the figure.2. How long does an average run take to process? Say a 35X human genome coverage sample? Are there requirements for memory? A figure and metrics around this sort of thing would be helpful.3. How much data was used per library? What was the total coverage? Was the data normalized to have the same coverage per library? If not, it's very difficult to make fair comparisons between the different technologies.4. There's a section on reconstruction of long fragments, but then there really isn't any evaluation of this result and it's not clear if these are even used for anything. For all of these sequencing types I would assume that you can't really do much in the way of seed extension since the coverage across long fragments for these methods is much less than 1X. I think this needs to be developed a little more or it needs to be explained how these are used in your process or you just need to say you didn't use them for anything but here's some potential applications they could be used for. What type of file is output from this process? I think it's interesting, but just not clear how to use this data.5. I did try to install the software using Conda, but it failed and it's not clear to me why. Perhaps it's something about my environment, but you might want to have some colleagues located in different institutions try to install it to make sure it is easy to do so.

    1. Results The cupuassu genome spans 423 Mb, encodes 31,381 genes distributed in the ten chromosomes, and it exhibits approximately 65% gene synteny with the T. cacao genome, reflecting a conserved evolutionary history, albeit punctuated with unique genomic variations. The main changes are pronounced by bursts of long-terminal repeats retrotransposons expansion at post-species divergence, retrocopied and singleton genes, and gene families displaying distinctive patterns of expansion and contraction. Furthermore, positively selected genes are evident, particularly among retained and dispersed, tandem and proximal duplicated genes associated to general fruit and seed traits and defense mechanisms, supporting the hypothesis of potential episodes of subfunctionalization and neofunctionalization following duplication, and impact from distinct domestication process. These genomic variations may underpin the differences observed in fruit and seed morphology, ripening, and disease resistance between cupuassu and the other Malvaceae species.

      Reviewer 2: Jian-Feng Mao Rafael et al. contributed their study, "Genomic decoding of Theobroma grandiflorum (cupuassu) at chromosomal scale:Evolutionary insights for horticultural innovation". In this study, high-quality genome assembly for an important plant was generated and the authors further investigated genome characterization, genome evolution, gene families etc. The data quality is high, though some points need to be clarified. And the reported data and investigations could provide valuable inference for following studies.This paper is generally well-prepared.Major comments:1. Quality control of genome assembly. The quality of genome assembly could be better evaluated with more stringent parameters. On assembly quality control, I will recommend to always follow criteria established in Earth Biogenome Project (Report on Assembly Standards, https://www.earthbiogenome.org/assembly-standards). Please evaluate the present assemblies with the criteria from EBP project, I think, on at least some if not all the items. At least, I think Merqury results would be very informative.Merqury: reference-free quality, completeness, and phasing assessment for genome assemblies https://genomebiology.biomedcentral.com/articles/10.1186/s13059-020-02134-92. Gaps in each pseudo-chromosome. Not clear if gaps are still remained, or the genome is of gap-free?3. Centromere region. How centromeres were identified? Centromeres were shown, but no description on how you did identify them. Given the high quality of genome assembly, it would be very interesting to incorporate the investigation into distribution of centromeres. A pipeline (https://github.com/ShuaiNIEgithub/Centromics (identifying centromere with multi-oimcs data, such as repeat profiling, and Hi-C chromatin contact) is helpful, and it was generally described at https://academic.oup.com/hr/article/10/1/uhac241/6775201?login=true) has already been prepared and widely applied in data analyses in some just published T2T assemblies.

    2. Background Theobroma grandiflorum (Malvaceae), known as cupuassu, is a tree indigenous to the Amazon Basin, valued for its large fruits and seed-pulp, contributing notably to the Amazonian bioeconomy. The seed-pulp is utilized in desserts and beverages, and its seed butter is used in cosmetics. Here, we present the sequenced telomere-to-telomere cupuassu genome, disclosing features of the genomic structure, evolution, and phylogenetic relationships within the Malvaceae.

      Reviewer 1: Xupo Ding 1. The Line or page number should be added in the revised manuscript, it is hard to point the comment to definite line.2. The methods and parameters of TE analysis should be detailed in the main text or supplementary file, especially for the LAI calculation, the LAI output by our pipeline is 11.47 and the pipeline was built according to default parameters of LTR_retiever (https://github.com/oushujun/LTR_retriever).3. What was the mutation rate (r) used for TE insert time calculation? If the insertion time were from the original files of EDTA, please notice that the default r is 1.3e-8 of grass family once --u was not set with promoting EDTA, that should be converted with the correct r value.4. Generally, the Gypsy content was usually more than Copia content in plant genome, please check it. If it were correct, please infer the reason.5. All results of GO enrichment were better enriched with KEGG.6. The results about enrichment were wrote hastily, lots of GO function or GO numbers were just list, the details should be abundant. Cite the Figures or tables or references in these sections.7. In Figure 1C, the Ks distribution need corrective, the authors can refer the polyploidization of durian genome published on Plant physiology in 2019.8. In Figure 2C, why some orders of TE loss the SD?9. In Figure 3A, T. grandiflorum and T. cacao present highly syntenic at gene level, the software of Liftoff might detect extra genes to T. grandiflorum genome based on the T. cacao genome. This is just a suggestion.10. In Figure 5A, there were 282 special genes in T. grandiflorum, please enrichment them with GO and KEGG.11. Figure 5B and D were from the GO enrichment the GO numbers should be added around annotation or list them in the supplementary files.12. In Figure 5C, the confidence interval of divergence time should be added.13. In the data availability, the weblink is not for everyone, GigaDB will record your data, so the unopened weblink might not necessary.14. In the MS, disease resistance were mentioned repeatedly, the GO enrichment has been provided some evidence, it will be better to perform the KEGG analysis with the special genes and expanded or contracted genes to verify, especially stat the changes in the ko04626.15. The language must be improved and modified by naive academic English speaker.

    1. Results Here we introduce Hecatomb, a bioinformatics platform enabling both read and contig based analysis. Hecatomb integrates query information from both amino acid and nucleotide reference sequence databases. Hecatomb integrates data collected throughout the workflow enabling analyst driven virome analysis and discovery. Hecatomb is available on GitHub at https://github.com/shandley/hecatomb.

      Reviewer 2: Satoshi Hiraoka In this manuscript, the authors developed a novel pipeline, Hecatomb, for viral genome analysis using metagenome and virome data that accepted both short- and long-read sequencing data. Using the pipeline, the authors performed the analysis using one virome and one metagenome dataset from different environments (stool and coral reef, respectively). The analyses showed reasonable results according to the original studies and rather they discovered candidate novel phages and new findings that possibly have great insight into the microbial ecology. The manuscript is overall informative and well-written. The Hecatomb incorporates famous bioinformatics tools that are frequently used in viral genome analyses today, allowing many researchers including beginners to examine virome datasets easily and effectively. Thus the pipeline is likely valuable and would contribute to wide studies of viruses, most of which are not cultured and its characteristics are unknown. Noteworthy, there is an informative document page ( https://hecatomb.readthedocs.io/en/latest/ ) including tutorials, which are very helpful for many users. I think this point could be more emphasized in the manuscript. However, unfortunately, lacking the analysis of the mock dataset makes it hard to estimate the accuracy of the pipeline. I think adding such kinds of analysis for evaluating the performance would greatly improve the study.I have some suggestions that would increase the clarity and impact of this manuscript if addressed.Major:In general, to clearly evaluate the efficiency of the novel bioinformatic tools and pipelines, benchmarking using ground-truth datasets is important in advance to the application using real datasets. To reach this, in this case, some artificial datasets that are composed of known viral and prokaryotic genomes with defined composition and library types (single and paired-end) and sequenced read length (current short- and long-reads) could be designed as mock metagenome data. Via the analysis using the mock datasets, the accuracy of the pipeline can be evaluated. It would be appreciated if the author performed such benchmarking tests as well as the real data applications.According to the GitHub page, the Hecatomb is designed to generate results that reduce false-positive and enrich for true-positive viral read detection. This point is important for understanding the purpose of developing the pipeline and differentiating the pipeline tool from other ones. The efficiency of the false-positive reduction using this pipeline would be better clearly shown in this manuscript. Therefore the mock dataset analyses are expected.When I read the manuscript, I was confused about what the targeted dataset the pipeline aiming for. Is the Hecatomb designed to analyze common prokaryotic shotgun metagenomic data to detect viruses? In other words, is the pipeline not limited to analyzing viral metagenomes (viromes), which specifically enriched viral particles from the samples for sequencing (e.g., density centrifugation to condense viral particles)? The stool samples were likely virome datasets (viral particles were enriched via 0.45-μm-pore-size membrane filtration according to the article), whereas the coral reef data are metagenome datasets. I would suggest that the terms "viral metagenome" (or virome, specifically targeting only viruses) and common "metagenome" (mainly focusing on prokaryotes) should be clearly distinguished throughout the manuscript including the title.I'm wondering about the sequence clustering step in Module 1. In my understanding, from the metagenomic settings, genomic regions are randomly sequenced, and thus most of the sequenced reads will not be clustered together using the criteria as described in the manuscript, and not so many sequences are reduced in this step. Is this step truly needed? Please add more explanation and importance about this step. For example, how many ratios of the reads were reduced in the test of the two real datasets (stool and coral reef) in this step?Minor:The introduction section is informative but a bit long. The section could be shortened.Some viruses were newly found using the pipeline (e.g., Fig1A). Which one is which virus types (dsDNA, ssDNA, dsRNA, ssRNA)? This information would be better to show clearly in the figure.I think the sequences derived from RNA viruses are generally not abundantly included in typical metagenomics datasets except if with specific techniques in the experiment. I think the potential for detecting RNA viruses from typical metagenomic DNA sequencing reads will be discussed in the Introduction section.L103. Please describe where the name "Hecatomb" is derived from in this article, though this is shown on the GitHub page.L119. " round A/B libraries" here, but I have not heard or could not find this term in the articles cited here. Please add more explanation of what is "round A/B libraries".L130 up to 2 insertions and deletions?L131. BBmap included in BBtools [73]?L181. A brief explanation of the "Baltimore classification" here would improve the readability for readers who are not familiar with this.L239. There is no explanation of what "SIV" means before.L253-L268 & Figure 4B. According to Figure 2A, there are two paths (1,2,5: aa and 1,3,4,5: nt) for detecting viral reads. I'm interested in which path is major and which is minor. Could the authors provide the ratio of the reads that predicted using aa or nt in each dataset examination (each stool and coral)?L431, L436. Not only BioProject but SRA accession ID should be provided.L479. There is no LACC here. What is his main contribution? Just reviewing and editing the manuscript is insufficient for citing as an author: see https://www.icmje.org/recommendations/browse/roles-and-responsibilities/defining-the-role-ofauthors-and-contributors.html#twoFigure 1. There are some DBs newly created and used in the pipeline (e.g., Viral AA DB, Multi-kingdom AA DB, Virus NT DB, and Polymicrobial MT DB). I think it would be better to add how to make the DBs in this or other figures. This must contribute to understanding how to construct the DBs and why to use them in this pipeline.Figure 1. specified (1)-(4) in the legend, not just color.Figure 4A. Please provide the total number of sequencing reads in addition to the read count assigned to each virus.Figure 4C. CPM was not explained in the manuscript and not listed in L460.L490. Some references are incomplete. e.g., lack of article ID or page number (49, 79, 90, 94, 95, 96, 100, 101, 102), remaining unnecessary words ("academic.oup.com" in 90, 91), etc. Please check the reference list carefully.Figure S5. Alignment length (bp)Table S2. For calculating the best hit identify, what database was used?

    2. Background Analysis of viral diversity using modern sequencing technologies offers extraordinary opportunities for discovery. However, these analyses present a number of bioinformatic challenges due to viral genetic diversity and virome complexity. Due to the lack of conserved marker sequences, metagenomic detection of viral sequences requires a non-targeted, random (shotgun) approach. Annotation and enumeration of viral sequences relies on rigorous quality control and effective search strategies against appropriate reference databases. Virome analysis also benefits from the analysis of both individual metagenomic sequences as well as assembled contigs. Combined, virome analysis results in large amounts of data requiring sophisticated visualization and statistical tools.

      Reviewer1: Arvind Varsani The MS titled "Hecatomb: An Integrated Software Platform for Viral Metagenomics" addresses the developed of a toolkit for viral meatgenomics analysis that assembles a variety of tools into a workflow.Overall, I do not have any issue with this MS or the toolkit.I have some minor points to help improve the MS and make it as current as possible.1. Line 40: I would include Cenote-take 2 PMID: 33505708, geNomad https://www.biorxiv.org/content/10.1101/2023.03.05.531206v12. Line 40: I would probably not cite the preprint of this current paper - see ref 21.3. Line 80: Actually Cenote-take (both version 1 and 2) both use HHMs and as far as I know so does geNomad.4. Line 248: Please note that Siphoviridae, Podoviridae and Myoviridae are not currently family names. See PMID: 366830755. This means you will likely need to edit you figure to collapse these to Caudovirales6. Line 250-251: Picornaviridae and Adenoviridiae should be in italics7. Line 270: Here and elsewhere, please note that a taxa do not infect a host, it is a virus that infects a host. "Mimiviridae, that infect Acanthamoeba, and Phycodnaviridae, that infect algae, are both dsDNA viruses with large genomes" should ideally be written as "Viruses in the family Mimiviridae infect Acanthamoeba and those in the family Phycodnavirida infect algae, are dsDNA viruses with large genomes."8. Figure 6: the name tags of the CDS/ ORFS are truncated e.g. replication initiate…, heat maturation prot…9. Figure 6: Major head protein should be major capsid protein.10. One thing that I would highlight is that none of the workflows / tool kits developed account for spliced CDS. This is a major issue in automation of virus genome annotation at the moment and with this there will be some degree of misidentification for taxa assignment.

    1. Findings The similarity between samples can be determined using indexed k-mer sequence variants. To increase statistical power, we use coverage information on variant sites, calculating similarity using a likelihood ratio-based test. Per sample error rate, and coverage bias (i.e. missing sites) can also be estimated with this information, which can be used to determine if a spatially indexed PCA-based pre-screening method can be used, which can greatly speed up analysis by preventing exhaustive all-to-all comparisons.

      Reviewer2: Qian Zhou In this paper, the authors have presented a tool, ntsm, which utilizes the k-mer distribution information directly from raw sequencing data for sample swap detection. The approach of bypassing the reference genome alignment step and saving computational resources is commendable. Utilizing k-mers for reference-free and de novo analysis of sequencing data is a valuable application. The authors have demonstrated the impressive performance of ntsm on low coverage data through experimental results presented in the manuscript, showcasing its strengths in terms of sensitivity, accuracy. However, while ntsm eliminates the need for reference genome alignment, it still relies on a pre-defined set of variant sites and pre-built PCA rotation matrices. This raises doubts about the true reference-free nature of ntsm and raises concerns about its generalizability to other species.Major comments:1.The concept of reference-free:I believe that ntsm's approach is not truly reference-free. In order to use ntsm, it requires the use of existing high-quality population SNP sites and kmers from the human reference genome. Additionally, the population PCA results are used to assist in pairwise comparisons between samples. Both of these information can only be obtained when a reference genome is available. A true referencefree tool would be applicable to species without a reference genome, such as SPLASH (Chaung et al., 2023, Cell). ntsm can be considered as an alignment-free or kmer-based tool.2.The reduction of computational costs:NTSM differs from Somalier in its computational workflow. To compare the computational costs or time, a holistic end-to-end comparison is necessary, rather than timing individual steps such as kmer counting and sample pairwise comparison separately. Conducting an end-to-end comparison for an analysis task allows users to have a comprehensive understanding of the tool's time and cost consumption. Furthermore, when comparing software, it is important to allocate computational resources fairly. For example, ntsm utilizes 16 threads in the 'Sample comparison process' stage, while for the 'k-mer counting (ntsm) vs. alignment (somalier)' stage, tools like bwa and minimap2, which can utilize multiple threads, were run using a single thread.3.Sensitivity and Specificity:More experimental details are needed. In the section 'Sensitivity and Specificity of Sample Swaps,' were the results obtained using the 39 HPRC samples? Did it include their Hi-C data?For Fig 6, did the results come from all sequencing datasets of the 39 samples, including Illumina and ONT? Since the results was obtained using full coverage, would the threshold change at lower coverage?For Fig 7, which demonstrates ntsm's results, was PCA information used as an auxiliary? Does the use of PCA information impact Sensitivity and Specificity?4.Regarding PCA-based method:The 39 HPRC samples used in the study are actually part of the 3,202 samples from the 1000 Genomes Project. Therefore, it is important to clarify whether the PCA matrix used in the study already includes information from these 39 samples. From a rigorous experimental design perspective, a precomputed PCA matrix should not include information from the 39 samples. Otherwise, the effect of the PCA matrix on these 39 samples may be overestimated. It raises questions about whether the same results can be achieved on non-1000 Genomes Project samples.5.The applicability of the tool:In order to expand the applicability of ntsm to a wider range of species, two aspects need to be addressed:1). Provide detailed information on customizing the sites file. From the site files available in ntsm code repository on GitHub, the process of selecting variant sites seems to be more complex than what is described in the manuscript, involving more than just SNP variants.2). The sites and PCA files should be user-customizable inputs instead of being built-in. This limitation restricts the application of ntsm to other species.Minor comments:The manuscript appears to have been hastily written and requires further polish by the authors.1. In Figure 6, A and B seem to be labeled incorrectly.2. In Figure 9, the two subplots have different y-axes, one labeled "min" and the other labeled "s." Could you clarify what each subplot is illustrating?3. When mentioning HPRC for the first time, it would be helpful to provide the full name and explanation of the acronym. However, the full explanation appears in the next paragraph.4. "We then keep only purine to pyrimidine (A or T to G or C) variants, as final insurance against possible human error influencing this tool" It seems there may be a mistake or confusion in the sentence. The writer should indeed mention "A/G <-> C/T" instead of "A/T <-> G/C" to accurately describe purine to pyrimidine variants. The writer may have made an error in describing the nucleotide exchange, or it could be a typographical mistake.5. There is a typo in the formula for estimating sequencing error rate. (nm)·log(1-… …

    2. Background Due to human error, sample swapping in large cohort studies with heterogeneous data types (e.g. mix of Oxford Nanopore, Pacific Bioscience, Illumina data, etc.) remains a common issue plaguing large-scale studies. At present, all sample swapping detection methods require costly and unnecessary (e.g. if data is only used for genome assembly) alignment, positional sorting, and indexing of the data in order to compare similarly. As studies include more samples and new sequencing data types, robust quality control tools will become increasingly important.

      Reviewer1: Jianxin Wang In this manuscript, authors present a fast intra-species sample swap detecting tool, named ntsm. By counting the relevant variant k-mers from samples, it estimates the probability of each allele at sites and then uses the likelihood ratio test to detect sample swaps. Compared with the alignment-based method, Somalier, nsam performs better on low coverage data (≤5X) and is more efficient in terms of memory and computing time. The authors use PCA-based spatial index heuristic to reduce the number of sample comparisons. Of course, in my opinion, compared with the time spent on counting k-mer, the time saved by the PCA-based method is trivial. In addition, ntsm also provides other features such as error rate estimation. The tool requires population snp information, which limits its applications in practice to some extent. Overall, ntsm is a fast and practical tool for calculating intra-species sample similarity and detecting sample swaps. The writing and experiments in this paper are generally well done. There are some major and minor issues that I suggest the authors consider addressing.Major issues:The paper mentions that due to high error rates, nanopore data is difficult to analyze. Can the authors analyze the performance of ntsm under different error rate data? In general, alignment-based methods may perform better on high error rate data. This is very useful information for users to choose the tool.The authors use the PCA-based spatial index heuristic to reduce the number of pairwise comparisons. However, the relation between PCA distance and similarity score is not clear here. How to ensure that samples with similarity scores less than the threshold are within the search radius?The paper involves two metrics, say, similarity score and relatedness, to detect sample swaps. Can the authors analyze the relation between them to help readers understand the advantages and disadvantages of the two methods?Minor issues:In the "Conlusions" section, the second "useful" in the sentence "this method provides other useful information useful in QC" is redundant."R=1, p<2.2e-16" in Figure 3 is not explained.In the "Sequencing error rate estimation" section, the variable n is not explained.In Figure 9, the case of the first letter of two y-axis labels (time) is inconsistent.

  3. May 2024
    1. Editors Assessment:

      This work is part of a series of papers from the Hong Kong Biodiversity Genomics Consortium sequencing the rich biodiversity of species in Hong Kong. This example presents the genome of the golden birdwing butterfly Troides aeacus (Lepidoptera, Papilionidae). A notable and popular species in Asia that faces habitat loss due to urbanization and human activities. The lack of genomic resources impedes conservation efforts based on genetic markers, as well as better understanding of its biology. Using PacBio HiFi long reads and Omni-C a 351Mb genome was assembled genome anchored to 30 pseudo-molecules. After reviewers requested more information on the genome quality it seems there was high sequence continuity with contig length N50 = 11.67 Mb and L50 = 14, and scaffold length N50 = 12.2 Mb and L50 = 13. Allowing a total of 24,946 protein-coding genes were predicted. This study presents the first chromosomal-level genome assembly of the golden birdwing T. aeacus, a potentially useful resource for further phylogenomic studies of birdwing butterfly species in terms of species diversification and conservation. This evaluation refers to version 1 of the preprint

    2. AbstractTroides aeacus, the golden birdwing (Lepidoptera, Papilionidae) is a large swallowtail butterfly widely distributed in Asia. Despite its occurrence, T. aeacus has been assigned as a major protective species in many places given the loss of their native habitats under urbanisation and anthropogenic activities. Nevertheless, the lack of its genomic resources hinders our understanding of their biology, diversity, as well as carrying out conservation measures based on genetic information or markers. Here, we report the first chromosomal-level genome assembly of T. aeacus using a combination of PacBio SMRT and Omni-C scaffolding technologies. The assembled genome (351 Mb) contains 98.94% of the sequences anchored to 30 pseudo-molecules. The genome assembly also has high sequence continuity with scaffold length N50 = 12.2 Mb. A total of 28,749 protein-coding genes were predicted, and high BUSCO score completeness (98.9% of BUSCO metazoa_odb10 genes) was also revealed. This high-quality genome offers a new and significant resource for understanding the swallowtail butterfly biology, as well as carrying out conservation measures of this ecologically important lepidopteran species.

      This work has been published in GigaByte Journal under a CC-BY 4.0 license (https://doi.org/10.46471/gigabyte.122), and has published the reviews under the same license. This is part of a thematic series presenting Data Releases from the Hong Kong Biodiversity Genomics consortium (https://doi.org/10.46471/GIGABYTE_SERIES_0006). These are as follows.

      Reviewer 1. Dr. Kumar Saurabh Singh

      Are the data and metadata consistent with relevant minimum information or reporting standards?

      No. 1. I've noticed that the genome assembly file has been uploaded to NCBI, but I couldn't locate the corresponding annotation files in GFF format. Additionally, I couldn't find gene models for Troides aeacus on NCBI or any other platform. As per Giga Science data policy, these files should be made publicly available. 2. The paper lacks information on the contig N50 and L50, although I did find this data on NCBI. Is there a specific reason for omitting the contig N50/L50 details from the main text or tables?

      Is there sufficient detail in the methods and data-processing steps to allow reproduction?

      Yes. 1. I have noticed that the QV value is missing for the given assembly. To assess the base-level accuracy of your assembly, the authors should calculate the consensus quality (QV), comparing the frequency of k-mers present in the raw Omni-C reads (as you only have short-reads from Omni-c) with those present across the final assembly perhaps using Merqury. 2. Incorporating Omni-c data did not result in a significant increase in the contig N50. Have you identified any specific reasons for this outcome? 3. The overall BUSCO completeness for proteins appears to be disproportionately low (~86%) compared to genomic completeness (~98%). Could this be attributed to the absence of RNAseq data for predicting accurate gene models?

      Is there sufficient data validation and statistical analyses of data quality?

      I believe it's essential to assess the assembly quality through comparative genomic analyses, a component seemingly missing from the manuscript. While the text mentions the availability of genomic resources within the same genus, conducting a genome-wide comparison of these assemblies could provide valuable insights into the overall synteny and contiguity of the T. aeacus assembly. To ensure annotation consistency, it's important to compare genome assemblies by generating distributions of intron/exon lengths for annotations across multiple assemblies.

      Reviewer 2. Dr.Xueyan Li

      Link to review: https://gigabyte-review.rivervalleytechnologies.comdownload-api-file?ZmlsZV9wYXRoPXVwbG9hZHMvZ3gvRFIvNDk1L0dpZ2FieXRlRFJSLTIwMjQwMS0wMS1jb21tZW50cy5kb2N4

      Re-review: The paper has substantially been enhanced after the first revision. I suggest that this manuscript can be published after the following minor revisions: 1.L279: ‘formosanus’ is also part of the scientific name which should be Italic type. 2.It’s recommended to beautify the figures and tables.

    1. Editors Assessment:

      This work is part of a series of papers from the Hong Kong Biodiversity Genomics Consortium sequencing the rich biodiversity of species in Hong Kong. This example assembles the genome of the common chiton, Liolophura japonica (Lischke, 1873). Chitons are marine molluscs that can be found worldwide from cold waters to the tropics that play important ecological roles in the environment, but to date are lacking in genomes with only a few assemblies available. This data was produced using PacBio HiFi reads and Omni-C sequencing data, the resulting genome assembly being around 609 Mb in size. From this 28,010 protein-coding genes were predicted. After review improved the methodological details the quality metrics look near chromosome-level, having a scaffold N50 length of 37.34 Mb and 96.1% BUSCO score. This high-quality genome should hopefully be a valuable resource for gaining new insights into the environmental adaptations of L. japonica in residing the intertidal zones and for future investigations in the evolutionary biology in Polyplacophorans and other molluscs.

      This evaluation refers to version 1 of the preprint

    2. AbstractChitons (Polyplacophora) are marine molluscs that can be found worldwide from cold waters to the tropics, and play important ecological roles in the environment. Nevertheless, there remains only two chiton genomes sequenced to date. The chiton Liolophura japonica (Lischke, 1873) is one of the most abundant polyplacophorans found throughout East Asia. Our PacBio HiFi reads and Omni-C sequencing data resulted in a high-quality near chromosome-level genome assembly of ∼609 Mb with a scaffold N50 length of 37.34 Mb (96.1% BUSCO). A total of 28,233 genes were predicted, including 28,010 protein-coding genes. The repeat content (27.89%) was similar to the other Chitonidae species and approximately three times lower than in the genome of the Hanleyidae chiton. The genomic resources provided in this work will help to expand our understanding of the evolution of molluscs and the ecological adaptation of chitons.

      This work has been published in GigaByte Journal under a CC-BY 4.0 license (https://doi.org/10.46471/gigabyte.123), and has published the reviews under the same license. This is part of a thematic series presenting Data Releases from the Hong Kong Biodiversity Genomics consortium (https://doi.org/10.46471/GIGABYTE_SERIES_0006). These are as follows.

      Reviewer 1. Jin Sun

      Are all data available and do they match the descriptions in the paper?

      Yes. The assembly and annotations can be found in the Figshare.

      Is the validation suitable for this type of data?

      Yes. I have examined the HiC interaction map, and I think the scaffolding is high-quality.

      Additional Comments:

      The presentation is clear, but I would suggest the authors include the latest BUSCO score for the gene models.

      Reviewer 2. Priscila M Salloum

      Is the language of sufficient quality?

      Yes. The language is appropriate and does not hinder understanding, but some minor proof reading could benefit the manuscript. I left a few suggestions in my comments to the authors.

      Are all data available and do they match the descriptions in the paper?

      No. The data made available on NCBI has the 632 scaffolds, but the 13 pseudomolecules are not shown (in GCA_032854445.1, under Chromosomes, it reads “This scaffold-level genome assembly includes 632 scaffolds and no assembled chromosomes”), please clarify where information/data for the 13 pseudomolecules can be found. The figshare repository has the annotation files, but it lacks a metadata file detailing what each of the annotation files is (the file names are descriptive, but they do not replace a metadata file). The data availability statement lacks information about the transcriptomes (were these made available?) Supplementary tables are mentioned in the text file but were not made available (at least not for review).

      Are the data and metadata consistent with relevant minimum information or reporting standards?

      Yes. All that was provided was consistent.

      Is the data acquisition clear, complete and methodologically sound?

      No. Some clarification is needed (was the same sample used for the genome and transcriptome assembly? Were the different tissues processed in the same way? What software were used for all the bioinformatics steps? What were all the parameters and filters used for genome and transcriptome assembly and annotation?) I left specific suggestions in a file with additional comments to the authors.

      Is there sufficient detail in the methods and data-processing steps to allow reproduction?

      No. Software versions, citations, and parameters are missing from the methods section. Some results refer to methods not explained in the methods section.

      Is the validation suitable for this type of data?

      Yes. More details on the BlobTools parameters used are needed.

      Is there sufficient information for others to reuse this dataset or integrate it with other data?

      No. Supplementary tables were mentioned but not provided (at least not for review). There is enough information for others to reuse the genome data, although more information in the methods section (as mentioned above) and a metadata file would make this even more useful. There is no mention of where the transcriptome has been deposited, and an extremely brief mention to how it was assembled (e.g., no details on parameters used or software versions).

      Additional Comments: Please include all citations in the reference list.

      And see additional file with comments: https://gigabyte-review.rivervalleytechnologies.com/journal/gx/download-files?YXJ0aWNsZV9pZD00OTYmZmlsZT0xOTgmdHlwZT1nZW5lcmljJnZpZXc9dHJ1ZQ==

    1. Editors Assessment:

      This work is part of a series of papers from the Hong Kong Biodiversity Genomics Consortium sequencing the rich biodiversity of species in Hong Kong. This example assembles the genome of the long-spined sea urchin Diadema setosum (Leske, 1778). Using PacBio HiFi long-reads and Omni-C data the assembled genome size was 886 Mb, consistent to the size of other sea urchin genomes. The assembly anchored to 22 pseudo-molecules/chromosomes, and a total of 27,478 genes including 23,030 protein-coding genes were annotated. Peer review added more to the conclusion and future perspectives. The data hopefully providing a valuable resource and foundation for a better understanding of the ecology and evolution of sea urchins.

      This evaluation refers to version 1 of the preprint

    2. AbstractThe long-spined sea urchin Diadema setosum is an algal and coral feeder widely distributed in the Indo-Pacific and can cause severe bioerosion on the reef community. Nevertheless, the lack of genomic information has hindered the study its ecology and evolution. Here, we report the chromosomal-level genome (885.8 Mb) of the long-spined sea urchin D. setosum using a combination of PacBio long-read sequencing and Omni-C scaffolding technology. The assembled genome contained scaffold N50 length of 38.3 Mb, 98.1 % of BUSCO (Geno, metazoa_odb10) genes, and with 98.6% of the sequences anchored to 22 pseudo-molecules/chromosomes. A total of 27,478 genes including 23,030 protein-coding genes were annotated. The high-quality genome of D. setosum presented here provides a significant resource for further understanding on the ecological and evolutionary studies of this coral reef associated sea urchin.

      This work has been published in GigaByte Journal under a CC-BY 4.0 license (https://doi.org/10.46471/gigabyte.121), and has published the reviews under the same license. This is part of a thematic series presenting Data Releases from the Hong Kong Biodiversity Genomics consortium (https://doi.org/10.46471/GIGABYTE_SERIES_0006). These are as follows.

      Reviewer 1. Phillip Davidson

      Is the language of sufficient quality?

      Yes. Minor language errors that should be corrected in copy-editing

      Additional Comments:

      In their work, Hui et al present a chromosome-level genome assembly for Diadema setosum, the long-spined urchin. This new data is especially exciting given no high-quality genomic resource for the Diadematoida is available, bolstering comparative genomics work of echinoderms and the study of this species. Overall, the methods and data are well described and have produced a high quality genome assembly and associated annotations that will be a valuable addition to the community. I have a handful of primarily minor suggestions detailed below:

      Major comments:

      1. Conclusions and future perspectives: Currently, this section is only a sentence and states the new assembly will “further understanding of ecology and evolution of sea urchins”, which I think is a little uninspiring. I think more detail can be provided in this section to explain how this genome assembly adds to current knowledge. For example, reiterating that this is the first chromosome-level Diadematoida assembly, or perhaps explaining with examples how a good reference genome can inform ecological studies. Overall, the significance of this work is not really explained which I think sells this nice work short.

      Minor comments:

      1. Lines 232-233 state the mean coding sequence is 483 bp which seems a bit low, but having examined the peptide fasta file, I believe the average amino acid length is 483 AA, giving an average coding sequence length of ~1449bp. Please confirm and correct if necessary. This would also increase the total # of coding basepairs listed in Table 1.

      2. Lines 66-71: The authors state there are 5 chromosome-level sea urchin assemblies, all of which are camarodonts. However, I believe there are at least three additional chromosome-level assemblies for sea urchins not mentioned: 1) Echinometra sp. EZ (Ketchum et al, 2022; https://academic.oup.com/gbe/article/14/10/evac144/6717576 ) and 2) Paracentrotus lividus (Marletaz et al, 2023; https://www.sciencedirect.com/science/article/pii/S2666979X23000617?via%3Dihub ) and 3) Strongylocentrotus purpuratus (https://www.echinobase.org/echinobase/) Further, P. lividus is not a camarodont, so the text should be corrected accordingly.

      3. Lines 106: Please state whether the individual samples for genome sequencing was male or female

      4. Lines 54-54: The BUSCO score is reported at 98.1% but it should be be specified if this is the complete BUSCO score or the single-copy BUSCO score. Ideally, the single copy and duplication scores, rather than the complete, score is reported so readers have an idea for the duplication rate/haploid-ness of the assembly. Same issue on lines 221. Thank you for reporting in Table 1.

      5. Line 56: Text states “27,478 genes including 23,030 protein coding genes” were annotated. Augustus often outputs genes and transcripts, so I am wondering if the authors mean 27K transcripts including 23K genes. If so, the authors should clarify. If not, I think a brief statement of what these additional 4K genes are would be informative

      6. Table 1: Please clarify if “HiFi (X): 21” is referring to 21X coverage. Please correct length of coding sequence to amino acid sequence, and total coding sequence length. Same with Figure 1 panel B.

      Reviewer 2. Remi Ketchum

      Minor Edits

      Line 62: Change to “lack a vertebral column” instead of “lack the” Line 64: Change to “sea urchins” instead of sea urchin Line 70: Ketchum et al 2022 in GBE produced a chromosome-level genome assembly of Echinometra sp. EZ so this citation should be included here. Line 91: change to “results in a reduction in coral community complexity”

      I think that the end of the introduction could use a sentence or two that explicitly states why this genome will be a valuable resource to the scientific community. I think this will also help wrap up the introduction.

      Line 101: Can you provide coordinates? Also could you remove the word ‘alive.’ Line 130: I am confused by what you mean “the sample was then proceeded” Line 181: Was this the same individual that you used for genomic DNA isolation? Line 196: please could you include the specific flags that you used for purge_dups? Did you run Hifiasm with the default parameters?

      Line 240: I would definitely try and include some more sentences in this section. Line 253: Is this section supposed to be here? I think this is meant to go into the methods section.

      The authors could think about potentially a comparison table of the different urchin genome stats that are available currently? I would also encourage the readers to generate KAT plots to validate that they have successfully collapsed the haplotypes – a common problem with higher heterozygosity.

      Reviewer 3. F. Marlétaz

      I think it would be great to give further detail on the statistics out of the hifiasm contiging step. What are the contig statistics (after the hifiasm step)?

    1. Editors Assessment:

      This work is part of a series of papers from the Hong Kong Biodiversity Genomics Consortium sequencing the rich biodiversity of species in Hong Kong. This example This work is part of a series of papers from the Hong Kong Biodiversity Genomics Consortium sequencing the rich biodiversity of species in Hong Kong. This example presenting the first whole genome assembly of Dacryopinax spathularia, an edible mushroom-forming fungus that is used in the food industry to produce natural preservatives. Using PacBio and Omni-C data a 29.2 Mb genome was assembled, with a scaffold N50 of 1.925 Mb and 92.0% BUSCO score demonstrating the quality (review pushing the authors to provide more detail and QC stats to help better convince on this). This data providing a useful resource for further phylogenomic studies in the family Dacrymycetaceae and investigations on the biosynthesis of glycolipids with potential applications in the food industry.

      This evaluation refers to version 1 of the preprint

    2. AbstractThe edible jelly fungus Dacryopinax spathularia (Dacrymycetaceae) is wood-decaying and can be commonly found worldwide. It has also been used in food additives given its ability to synthesize long-chain glycolipids. In this study, we present the genome assembly of D. spathularia using a combination of PacBio HiFi reads and Omni-C data. The genome size of D. spathularia is 29.2 Mb and in high sequence contiguity and completeness, including scaffold N50 of 1.925 Mb and 92.0% BUSCO score, respectively. A total of 11,510 protein-coding genes, and 474.7 kb repeats accounting for 1.62% of the genome, were also predicted. The D. spathularia genome assembly generated in this study provides a valuable resource for understanding their ecology such as wood decaying capability, evolutionary relationships with other fungus, as well as their unique biology and applications in the food industry.

      This work has been published in GigaByte Journal under a CC-BY 4.0 license (https://doi.org/10.46471/gigabyte.120), and has published the reviews under the same license. This is part of a thematic series presenting Data Releases from the Hong Kong Biodiversity Genomics consortium (https://doi.org/10.46471/GIGABYTE_SERIES_0006). These are as follows.

      Reviewer 1. Anton Sonnenberg

      Is the language of sufficient quality? Yes.

      Are all data available and do they match the descriptions in the paper? Yes.

      Is the data acquisition clear, complete and methodologically sound? Yes.

      Is there sufficient detail in the methods and data-processing steps to allow reproduction? Yes.

      Figure 1E could be improved by eliminating in the pie-chart the non-repeat sequences or bar-plot the repeats. That will visualize better the frequencies of each type of repeats.

      Reviewer 2. Riccardo Iacovelli

      Is the language of sufficient quality? No.

      There are several typos spread across the text, and some sentences are written in an unclear manner. I provide some suggestions in the attachment.

      Are all data available and do they match the descriptions in the paper?

      Yes, but some of the data shown is rather unclear and/or not supported by sufficient explanation. For example, what is actually Fig. 1C showing? Because the reference in the text (which contains a typo, line 197) refers to something else. What is the second set of stats in Fig. 1B? This other organism is not mentioned at all anywhere in the manuscript.

      Are the data and metadata consistent with relevant minimum information or reporting standards?

      No. NCBI TaxID of the sequenced species object of this work is missing.

      Is there sufficient detail in the methods and data-processing steps to allow reproduction?

      No. In my opinion, some of the procedures described for the processing of the sample and library prep for sequencing are reported in an unclear way. For example, lines 100-103: no details on RNAse A treatment; how do you define chloroform:IAA (24:1) washes? how much supernatant is added to how much H1 buffer to have the final volume of 6 ml? Another example, lines 180-175: what parameters did you use for EvidenceModeler to generate the final consensus genes model? The weight given to each particular prediction set is important.

      Is there sufficient data validation and statistical analyses of data quality?

      No/ While sufficient data validation and statistical analyses have been carried out with respect to DNA sequencing and genome assembly, nothing is reported about DNA extraction and quality. The authors mention several times throughout the text that DNA preps are checked via NanoDrop, Qubit, gel electrophoresis, etc. But none of this is shown in the main body or in the supplementary information. Without this information, it is difficult to assess directly the efficacy of DNA extraction and preparation methods. I recommend including this type of data.

      Additional Comments:

      In this article, the authors report the first whole genome assembly of Dacryopinax spathularia, an edible mushroom-forming fungus that is used in the food industry to produce natural preservatives. In general, I find the data of sufficiently high quality for release, and I do agree with the authors in that it will prove useful to gain further insights into the ecology of the fungus, and to better understand the genetic basis of its ability to decay wood and produce valuable compounds. This can ultimately lead to discoveries with applications in biotech and other industries.

      Nevertheless, during the review process I noticed several shortcomings with respect to unclear language, insufficient description of the experimental procedures and/or results presented, and missing data altogether. These are all discussed within the checklist available in the ReView portal. For minor comments line-by-line, see below:

      1: Dacrymycetaceae should be italicized (throughout the whole manuscript). This follows the convention established by The International Code of Nomenclature for algae, fungi, and plants (https://www.iaptglobal.org/icn). Although not binding, this allows easy recognition of taxonomic ranks when reading an article. 49: other fungus -> other fungi 56: photodynamic injury -> UV damage/radiation (photodynamic is used with respect to light-activated therapies etc.) 60: in food industry as natural preservatives in soft drinks -> in food industry to produce natural preservatives for soft drinks 68: cultivated in industry as food additives -> cultivated in industry to produce food additives 69: isolated fungal extract -> the isolated fungal extract 71: What do you mean by Pacific? It’s unclear 71-72: the genomic resource -> genomic data/ genome sequence 72: I would remove “with translational values”, it is very vague and does not add anything to the statement 78: genomic resource -> genomic data/ genome sequence 78-81: this could be rephrased in a smoother manner: e.g. something like “the genomic data will be useful to gain a better understanding of the fungus’ ecology as well as the genetic basis of its wood-decaying ability and…” 85: fruit bodies -> fruiting bodies 88-89: Grown hyphae from >2 week-old was transferred  Fungal hyphae from 2-week old colonies were transferred 90-91: validated with the DNA barcode of Translation  assigned by DNA barcoding using the sequence of Translation… 95: ~ -> Approximately (sentences are not usually started with symbols or numbers) 101-3: Procedure is not clear enough (see other comments through ReView portal) 124: for further cleanup the library -> to further clean up the library / for further cleanup of the library 132: as line 95 152: as lines 95, 132 181-5: Insufficient description of methods, see comments through ReView portal 197: Figure and 1C; Table 2 -> Figure 1C and Table 2 200: average protein length of 451 bp -> average protein-coding gene length / average protein length of ~150 amino acids 211: via the fermentation process with applications in the food industry -> via the fermentation process with potential applications in the food industry

      As a fungal biologist myself interested in fungal genomics and biotechnology, I would like to thank the authors for carrying out this work and the editor for the opportunity to review it. I am looking forward to reading the revised version of the manuscript.

      Riccardo Iacovelli, PhD GRIP, Chemical and Pharmaceutical Biology department University of Groningen, Groningen - The Netherlands

    1. Editors Assessment:

      This work is part of a series of papers from the Hong Kong Biodiversity Genomics Consortium sequencing the rich biodiversity of species in Hong Kong. This example assembles the genome of the milky mangrove Excoecaria agallocha, also known as blind-your-eye mangrove due to its toxic properties of its milky latex that can cause blindness when it comes into contact with the eyes. Living in the brackish water of tropical mangrove forests from India to Australia, they are an extremely important habitat for a diverse variety of aquatic species, including the mangrove jewel bug of which this tree is the sole food source for the larvae. Using PacBio HiFi long-reads and Omni-C technology a 1,332.45 Mb genome was assembled, with 1,402 scaffolds and a scaffold N50 of 58.95 Mb. After feedback the annotations were improved, predicting a very high number (73,740) protein coding genes. The data presented here provides a valuable resource for further investigation in the biosynthesis of phytochemical compounds in its milky latex with the potential of many medicinal and pharmacological properties. As well as increasing the understanding of biology and evolution in genome architecture in the Euphorbiaceae family and mangrove species adapted to high levels of salinity.

      This evaluation refers to version 1 of the preprint

    2. AbstractThe milky mangrove Excoecaria agallocha is a latex-secreting mangrove that are distributed in tropical and subtropical regions. While its poisonous latex is regarded as a potential source of phytochemicals for biomedical applications, the genomic resources of E. agallocha remains limited. Here, we present a chromosomal level genome of E. agallocha, assembled from the combination of PacBio long-read sequencing and Omni-C data. The resulting assembly size is 1,332.45 Mb and has high contiguity and completeness with a scaffold N50 of 58.9 Mb and a BUSCO score of 98.4 %. 73,740 protein-coding genes were also predicted. The milky mangrove genome provides a useful resource for further understanding the biosynthesis of phytochemical compounds in E. agallocha.

      This work has been published in GigaByte Journal under a CC-BY 4.0 license (https://doi.org/10.46471/gigabyte.119), and has published the reviews under the same license. This is part of a thematic series presenting Data Releases from the Hong Kong Biodiversity Genomics consortium (https://doi.org/10.46471/GIGABYTE_SERIES_0006). These are as follows.

      Reviewer 1. Minghui Kang

      Is the data acquisition clear, complete and methodologically sound?

      The sample collection site needs to include latitude and longitude data.

      Is there sufficient detail in the methods and data-processing steps to allow reproduction?

      Please add the software version number to all the software mentioned in the manuscript. Additionally, if the software uses default parameters, please provide the corresponding description. If specific parameters are used, please indicate the corresponding parameters

      Additional Comments: This study presents the assembly of an Excoecaria agallocha genome using PacBio HiFi and Omni-C technologies. The assembly exhibits good contiguity and completeness, providing a valuable resource for further understanding the phylogenetic position, evolutionary history, and natural product biosynthesis in Excoecaria agallocha. However, there are still some issues that need to be addressed and modified, including the following points: L82 It would be preferable to mention the number of chromosomes and the anchor rate of the chromosome-scale assembly here, as well as the estimated genome size based on K-mer analysis, to further support the accuracy and completeness of the assembly. L88 I think the authors need to rearrange the order of the figures, as it is not appropriate for Fig. 1F to appear before Fig. 1A. Please check the results part and arrange the pictures in a reasonable order. L117 The sample collection site needs to include latitude and longitude data. L187 Please add the software version number to all the software mentioned in the manuscript. Additionally, if the software uses default parameters, please provide the corresponding description. If specific parameters are used, please indicate the corresponding parameters. L219 The pseudochromosome scaffolding rate of 86.08% appears to be somewhat low (<90%). The sequences that were not scaffolded onto chromosomes could be a result of untrimmed redundancy in the genome assembly or could indicate some assembly errors. L220 Please note that in this instance, Fig. 1C appears before Fig. 1B in the text. I kindly request the author to review and adjust the numbering and arrangement of figures throughout the entire manuscript. L223 The quality of gene annotation appears to be significantly lower than the quality of genome assembly (82.1%/98.4%), indicating poor gene annotation accuracy. Please review the accuracy of the HMM model trained by the Augustus software or consider using a more accurate annotation workflow. L225 Unclassified repetitive sequences account for over 50% of the total repetitive sequences, which can significantly impact subsequent analyses relying on repetitive sequences. It is recommended to use alternative software, such as The Extensive de novo TE Annotator (EDTA), which provides more accurate classification and utilizes a more comprehensive repetitive sequence library, to validate these results.

      Reviewer 2. Dr.Jarkko Salojarvi

      Is the language of sufficient quality? Yes. Are all data available and do they match the descriptions in the paper? Yes Are the data and metadata consistent with relevant minimum information or reporting standards? Yes Is the data acquisition clear, complete and methodologically sound? Yes Is there sufficient detail in the methods and data-processing steps to allow reproduction? Yes Is there sufficient data validation and statistical analyses of data quality? Yes Is the validation suitable for this type of data? Yes Is there sufficient information for others to reuse this dataset or integrate it with other data? Yes

    1. Editors Assessment:

      The King Angelfish (Holacanthus passer) is a great example of a Holacanthus angelfish that are some of the most iconic marine fishes of the Tropical Eastern Pacific. However, very limited genomic resources currently exist for the genus and these authors have assembled and annotated the nuclear genome of the species, and used it examine the demographic history of the fish. Using nanopore long reads to assemble a compact 583 Mb reference with a contig N50 of 5.7 Mb, and 97.5% BUSCOs score. Scruitinising the data, the BUSCO score was high compared to the initial N50’s, providing some useful lessons learned on how to get the most out of ONT data. The analysis suggests that the demographic history in H. passer was likely shaped by historical events associated with the closure of the Isthmus of Panama, rather than by the more recent last glacial maximum. This data provides a genomic resource to improve our understanding of the evolution of Holacanthus angelfishes, and facilitating research into local adaptation, speciation, and introgression of marine fishes. In addition, this genome can help improve the understanding of the evolutionary history and population dynamics of marine species in the Tropical Eastern Pacific.

      This evaluation refers to version 1 of the preprint

    2. AbstractHolacanthus angelfishes are some of the most iconic marine fishes of the Tropical Eastern Pacific (TEP). However, very limited genomic resources currently exist for the genus. In this study we: i) assembled and annotated the nuclear genome of the King Angelfish (Holacanthus passer), and ii) examined the demographic history of H. passer in the TEP. We generated 43.8 Gb of ONT and 97.3 Gb Illumina reads representing 75X and 167X coverage, respectively. The final genome assembly size was 583 Mb with a contig N50 of 5.7 Mb, which captured 97.5% complete Actinoterygii Benchmarking Universal Single-Copy Orthologs (BUSCOs). Repetitive elements account for 5.09% of the genome, and 33,889 protein-coding genes were predicted, of which 22,984 have been functionally annotated. Our demographic model suggests that population expansions of H. passer occurred prior to the last glacial maximum (LGM) and were more likely shaped by events associated with the closure of the Isthmus of Panama. This result is surprising, given that most rapid population expansions in both freshwater and marine organisms have been reported to occur globally after the LGM. Overall, this annotated genome assembly will serve as a resource to improve our understanding of the evolution of Holacanthus angelfishes while facilitating novel research into local adaptation, speciation, and introgression in marine fishes.

      This work has been published in GigaByte Journal under a CC-BY 4.0 license (https://doi.org/10.46471/gigabyte.115), and has published the reviews under the same license. These are as follows.

      Reviewer 1. Iria Fernandez Silva

      Is the language of sufficient quality? Yes. But, A "the" is missing before "clingfish" in line 171

      Additional Comments:

      The genome assembly presented is of high quality, with values of accuracy and completeness in pair with chromosome level assemblies. The study is very well presented in terms on quality of the results and clarity in the presentation of methods and results. An added value is that it allows understanding how different type of data and assemblers interact in improvng the assembly quality. I also found interesting to see how contiguity and completeness are not always correlated, as this assembly has a great completeness BUSCO score in spite of not having the greatest N50 (compared with the most modern assemblies). This is possibly inherent to the type of data (ONT reads) and this information may guide researchers in making decission over future assembly projects. The demographic analysis is a nice addition to the study, the results are coherent and add information interesting to study the evolution of reef fishes and the biogeography of the TEP. I would appeciate more detail in the captions of figure 4, particularly those of the figure 4D.

      Reviewer 2. Yue Song

      The sequencing and annotation of King Angelfish genomes is impressive and represents a significant addition to the genomic resources for marine fishes. By hybrid assembly, a high-quality genome was provided, and the relationship between historical dynamics of its population and geological events was further discussed. However, in the section on inferring the demographic history, there is no mention of how the author inferred the mutation rate of this species. In addition, the author obtained 486 contigs throughout the assembly using ONT data combined with short reads. Is it possible to further assemble these contigs into chromosomal level? Of course, this does not indicate that it must be achieved within this manuscript, but rather suggests the inclusion of additional discussion on methods to further enhance the referential value of this genome. Additional specific comments: (1) Line 86, I guess the author probably meant to say there were 486 contigs, right? (2) Line 294, "gene models", not "gen models" (3) Line 110-111, it is puzzled my about the numbers in parentheses. I don't quite understand what these numbers mean. I haven't seen any explanation in this MS. Did I miss something? (4) If possible, it is recommended to show the phylogenetic relationships between these species in Figure 3.

    1. Editors Assessment: Marsupial species are invaluable for comparative studies due to their distinctive modes of reproduction and development, but there are a shortage of genomic resources to do these types of analyses. To help address that data gap multi-tissue transcriptomes and transcriptome assemblies have been sequenced and shared for the fat-tailed dunnart (Sminthopsis crassicaudata), a mouse-like marsupial that due to is ease of breeding is emerging as a useful lab model. Using ONT nanopore and Pacbio long-reads and illumina short reads 2,093,982 transcripts were sequenced and assembled, and functional annotation of the assembled transcripts was also carried out. Some addition work was required to provide more details on the QC metrics and access to the data but this was resolved during review. This work ultimately producing dunnart genome assembly measuring 3.23 Gb in length and organized into 1,848 scaffolds, with a scaffold N50 value of 72.64 Mb. These openly available resources hopefully provide novel insights into the unique genomic architecture of this unusual species and provide valuable tools for future comparative mammalian studies.

      This evaluation refers to version 1 of the preprint

    2. AbstractMarsupials exhibit highly specialized patterns of reproduction and development, making them uniquely valuable for comparative genomics studies with their sister lineage, eutherian (also known as placental) mammals. However, marsupial genomic resources still lag far behind those of eutherian mammals, limiting our insight into mammalian diversity. Here, we present a series of novel genomic resources for the fat-tailed dunnart (Sminthopsis crassicaudata), a mouse-like marsupial that, due to its ease of husbandry and ex-utero development, is emerging as a laboratory model. To enable wider use, we have generated a multi-tissue de novo transcriptome assembly of dunnart RNA-seq reads spanning 12 tissues. This highly representative transcriptome is comprised of 2,093,982 assembled transcripts, with a mean transcript length of 830 bp. The transcriptome mammalian BUSCO completeness score of 93% is the highest amongst all other published marsupial transcriptomes. Additionally, we report an improved fat-tailed dunnart genome assembly which is 3.23 Gb long, organized into 1,848 scaffolds, with a scaffold N50 of 72.64 Mb. The genome annotation, supported by assembled transcripts and ab initio predictions, revealed 21,622 protein-coding genes. Altogether, these resources will contribute greatly towards characterizing marsupial biology and mammalian genome evolution.

      This work has been published in GigaByte Journal under a CC-BY 4.0 license (https://doi.org/10.46471/gigabyte.118), and has published the reviews under the same license. These are as follows.

      Reviewer 1: Qiye Li

      For the ONT, PacBio and Illumina data for genome assembly, is there any new data that was generated in this manuscript? Are all of the data collected from the same individual? If so, what is the gender of the individual for genome assembly? It will be appreciated to make this information clear to readers. Page 3: I think "Pacific Biosciences CRL" should be modified to "Pacific Biosciences CLR"

      Reviewer 2. Emma Peel.

      Are all data available and do they match the descriptions in the paper?

      No. The figshare link doesn't work, but I'm presuming this is because the paper hasn't been published? Will data be accessioned in the GigaScience Database to ensure accessiblity? The illumina short-read genomic and RNAseq datasets are available through NCBI and match descriptions in the paper. I was unable to find the raw PB and ONT data from [68] that was used to generate the genome assembly. The authors of [68] indicate these datasets are available in supplementary table 3, but if you click through the figshare link in this table the raw data isn't there, nor anywhere else listed in the data availability section. Can the authors please clarify the location of the raw data and update the data availability section of this manuscript accordingly.

      Are the data and metadata consistent with relevant minimum information or reporting standards?

      Yes. Access to the GigaDB accession hasn't been provided, so I am unable to determine if the data and metadata is consistent with minimum information reporting standards according to the GigaDB checklists.

      Is the data acquisition clear, complete and methodologically sound?

      Yes. Some minor clarifications are required, see comments in the PDF. For example, please include detail on how RNA quality was determined (e.g. RIN numbers) and provide more detail regarding method of library preparation, flowcell and instrument used for Illumina sequencing.

      Is there sufficient detail in the methods and data-processing steps to allow reproduction?

      Yes. The only detail lacking is the method of transcript quantification used to determine the top 90% most highly expressed transcripts.

      Is the validation suitable for this type of data?

      Yes. Data validation is suitable, however I would like to see a comparison of v1.1 genome assembly with other marsupial genome assemblies.

      Additional Comments:

      This study is an important addition to marsupial omics resources, and I was excited to see such a comprehensive set of transcriptomes. My main comment is the need to explain and discuss the initial assembly (v1) in the introduction to provide context for the improved assembly. See comments in the attached PDF.

      Annotated paper: https://gigabyte-review.rivervalleytechnologies.comdownload-api-file?ZmlsZV9wYXRoPXVwbG9hZHMvZ3gvRFIvNDg3L2d4LURSLTE3MDE2Njk5NzdfRVAgKDIpLnBkZg==

    1. AbstractBackground The virome obtained through virus-like particle enrichment contain a mixture of prokaryotic and eukaryotic virus-derived fragments. Accurate identification and classification of these elements are crucial for understanding their roles and functions in microbial communities. However, the rapid mutation rates of viral genomes pose challenges in developing high-performance tools for classification, potentially limiting downstream analyses.Findings We present IPEV, a novel method that combines trinucleotide pair relative distance and frequency with a 2D convolutional neural network for distinguishing prokaryotic and eukaryotic viruses in viromes. Cross-validation assessments of IPEV demonstrate its state-of-the-art precision, significantly improving the F1-score by approximately 22% on an independent test set compared to existing methods when query viruses share less than 30% sequence similarity with known viruses. Furthermore, IPEV outperforms other methods in terms of accuracy on most real virome samples when using sequence alignments as annotations. Notably, IPEV reduces runtime by 50 times compared to existing methods under the same computing configuration. We utilized IPEV to reanalyze longitudinal samples and found that the gut virome exhibits a higher degree of temporal stability than previously observed in persistent personal viromes, providing novel insights into the resilience of the gut virome in individuals.Conclusions IPEV is a high-performance, user-friendly tool that assists biologists in identifying and classifying prokaryotic and eukaryotic viruses within viromes. The tool is available at https://github.com/basehc/IPEV.Competing Interest StatementThe authors have declared no competing interest.FootnotesRepair the typos of the title.

      Reviewer 2. Mohammadali Khan Mirzaei

      Yin et al. have developed a new tool to differentiate eukaryotic and prokaryotic viruses. The tool offers a potential benefit to the community, but there are several issues with the contribution in its current form, as discussed below.

      Major issues: The authors should separate their training and testing databases. Ideally, their testing dataset should include a set of previously unseen viruses that have their host experimentally confirmed. In addition, the performance of IPEV should be compared with tools commonly used in the field, including vcontact2: https://doi.org/10.1038/s41587-019-0100-8 and iPHoP: https://doi.org/10.1371/journal.pbio.3002083. However, none of these tools are developed to directly differentiate eukaryotic and prokaryotic viruses, identification of viral taxonomy or host range could lead to the identification of viral type. Moreover, the authors have used multiple approaches for their assessment of the type of viruses. Yet, it is not clear how they combined the results they generated by these approaches in their decisions.

      Minor issues: Please use either phageome or phages instead of phage virome. There are some typos in the text that need to be fixed.

    2. AbstractBackground The virome obtained through virus-like particle enrichment contain a mixture of prokaryotic and eukaryotic virus-derived fragments. Accurate identification and classification of these elements are crucial for understanding their roles and functions in microbial communities. However, the rapid mutation rates of viral genomes pose challenges in developing high-performance tools for classification, potentially limiting downstream analyses.Findings We present IPEV, a novel method that combines trinucleotide pair relative distance and frequency with a 2D convolutional neural network for distinguishing prokaryotic and eukaryotic viruses in viromes. Cross-validation assessments of IPEV demonstrate its state-of-the-art precision, significantly improving the F1-score by approximately 22% on an independent test set compared to existing methods when query viruses share less than 30% sequence similarity with known viruses. Furthermore, IPEV outperforms other methods in terms of accuracy on most real virome samples when using sequence alignments as annotations. Notably, IPEV reduces runtime by 50 times compared to existing methods under the same computing configuration. We utilized IPEV to reanalyze longitudinal samples and found that the gut virome exhibits a higher degree of temporal stability than previously observed in persistent personal viromes, providing novel insights into the resilience of the gut virome in individuals.Conclusions IPEV is a high-performance, user-friendly tool that assists biologists in identifying and classifying prokaryotic and eukaryotic viruses within viromes. The tool is available at https://github.com/basehc/IPEV.

      This work has been published in GigaScience Journal under a CC-BY 4.0 license (https://doi.org/10.1093/gigascience/giae018), and has published the reviews under the same license. These are as follows.

      Reviewer 1: Guillermo Andres Rangel-Pineros

      Yin et al described the development and testing of IPEV, a deep-learning-based model that detects and discriminates sequences derived from prokaryotic and eukaryotic viruses in virome datasets. The model was developed using a set of reference viral sequences with known host information. The sequences were represented as sequence pattern matrices that contained values derived from the frequency and order of trinucleotide pairs. These matrices were subsequently used to train a 2D convolutional neural network that generates a 2-value vector for each input sequence, indicating the probability that the sequence corresponds to a prokaryotic or eukaryotic virus. The model was trained and tested using 5-fold cross validation on the reference set, and the authors assessed the robustness of the method using input datasets covering a range of homology and mutation rate values. Finally, the authors applied their model to a gut virome dataset from Shkoporov et al 2019.

      Indeed, IPEV represents a novel method that classifies viral sequences based on the type of host they target (prokaryotic or eukaryotic), and the results presented indicate that it efficiently covers a wide range of sequence lengths (from 100 bp). A model like IPEV provides a focus on eukaryotic viruses that is relatively shallow, in comparison with phages for which a wide range of prediction tools have been developed to date. Nevertheless, there are a few points that the authors need to address, particularly in relation to the robustness of the model:

      Major

      1) I am concerned about the number of reference sequences that were employed to train the model, and it makes me question its general applicability to viromes from any kind of biome. It would be great if the authors incorporated more sequences to their training and validation. Sources of viral sequences such as IMG/VR (https://img.jgi.doe.gov/cgi-bin/vr/main.cgi) and RVDB (https://rvdb.dbi.udel.edu/) could be useful for identifying further sequences, and generate a set that cover a much wider range of viral diversity. Perhaps, this could also lead to an improved performance for the gut datasets.

      2) Even though viral enrichment methods increase the concentration of viral DNA, the presence of contaminant DNA from other microbes in the enriched viral samples is common. Currently, the results do not indicate what the performance of the model would be in the presence of contaminating sequences. I suggest the authors to carry out tests that demonstrate the performance of IPEV when analysing a sample containing microbial contamination (ideally from both prokaryotes and eukaryotes) and demonstrate that IPEV is not prone to wrongly reporting these sequences as viruses.

      3) I find the results of the gut samples interesting and appropriate for the scope of IPEV. However, if IPEV is meant to be a general-purpose tool for virome analysis, it would be ideal if the authors provided results demonstrating the performance of the tool with samples from other biomes. For example, the authors could analyse datasets from the TARA Oceans project (e.g., 10.1016/j.cell.2019.03.040), some of which have already been assembled (https://www.ebi.ac.uk/ena/browser/view/PRJEB22493) .

      4) There are several instances in the manuscript where the authors indicate the existence of significant differences between metrics measured to compare the performance of tools (e.g., line 326: “which was significantly higher than the mean AUC values of …”), but there is no mention of statistical analyses conducted to reach those conclusions (except for the Wilcoxon rank-sum test in line 305). Please provide information on statistical tests conducted to identify the significant differences.

      Minor

      1) There is a reference missing in line 37. 2) In the sentence between lines 41-44, it is not clear what you are referring to with “identification of viral sequences”. Are you referring to viral vs non-viral, or to host identification? 3) Line 50: you mean “identification” or “differentiation”? 4) The two sentences between lines 49 – 52 seem redundant. I would suggest rewriting these into a single sentence. 5) Line 65: the latest version of ICTV taxonomy has 11,273 species. Please update this number. 6) Line 67: there is a newer version of VirSorter (VirSorter2), which has an expended scope in comparison with the older version. Please, modify the text to include the most up-to-date version of this tool. 7) There are some more tools with a varied range of strategies for viral prediction that are widely known among the community, which I feel should be mentioned in the introduction (e.g., VIBRANT, DeepVirFinder, PPR-Meta, etc). Even though none of these were explicitly designed for prediction of eukaryotic viruses, it’d be worth commenting on them. 8) Indicate the version of Virus-Host DB used, and the version or date when the viral data was retrieved from NCBI. 9) Line 124: do you mean 10 samples or 10 adults? If it’s the latter, please correct the sentence. 10) Line 130: by “genome sequences” are you referring to the assembled viral contigs? In that case, please clarify as it is currently ambiguous. 11) Tables 1 and 2, perhaps consider presenting these results as plots? I feel that the tables are rather hard to process. 12) Line 274: This is a rather old reference, are you sure the error rate for PacBio is still this high? I would suggest looking at more up-to-date references. 13) Line 279: replace “base insert or delete” with “insertions or deletions”. 14) Table 3: Indicate the length range of the analysed sequences in the header. 15) The section regarding the performance on functional proteins seems to include information that should be split between methods and results. Please modify accordingly. 16) Please italicise names of viral taxa wherever they are mentioned in the manuscript (e.g., Tubulavirales and Timlovirales in Line 300). 17) Line 320: This sounds as if the authors had conducted the experiments to collect the gut virome data. Rewrite to make it clear that these data were retrieved from a previous study. 18) Line 331: Based on which observation did you reach this conclusion? 19) Line 368: Wasn’t HTP developed for addressing a similar question? Please clarify. 20) Line 409-410: The way the sentence is written seems to indicate that plant viruses can also infect human cells and microorganisms. Please rewrite to make it clearer. 21) Regarding the tool’s text output, I would suggest modifying it to make it easier to parse (for example, leaving it as a tabular .csv file), and currently the header does not seem to accurately describe the contents of the file.

      Re-review: Yin et al described the development and testing of IPEV, a deep-learning-based model that detects and discriminates sequences derived from prokaryotic and eukaryotic viruses in virome datasets. The model was developed using a set of reference viral sequences with known host information. The sequences were represented as sequence pattern matrices that contained values derived from the frequency and order of trinucleotide pairs. These matrices were subsequently used to train a 2D convolutional neural network that generates a 2-value vector for each input sequence, indicating the probability that the sequence corresponds to a prokaryotic or eukaryotic virus. The model was trained and tested using 5-fold cross validation on the reference set, and the authors assessed the robustness of the method using input datasets covering a range of homology and mutation rate values. Finally, the authors applied their model to a gut virome dataset from Shkoporov et al 2019, and marine virome datasets from Gregory et al 2019. Indeed, IPEV represents a novel method that classifies viral sequences based on the type of host they target (prokaryotic or eukaryotic), and the results presented indicate that it efficiently covers a wide range of sequence lengths (from 100 bp). A model like IPEV provides a focus on eukaryotic viruses that is relatively shallow, in comparison with phages for which a wide range of prediction tools have been developed to date. In my opinion, the authors satisfactorily addressed the comments and suggestions made in the first round of review. I only have a few final suggestions to finalise the manuscript and have it ready for publication: 1) The authors include some text in the Discussion section (paragraph from line 423 to line 436, and paragraph from line 437 to 448) that, in my opinion, would fit better in the Results section. I suggest the authors include these in the Results section, and then in the Discussion comment how those results compare to other methods and what are their implications. 2) I would suggest modifying the sentence in line 42 like this: "Nonetheless, it is essential to note that enriched sample approaches carry the risk of losing valuable host or environmental information [8], potentially leading to inaccurate virus host identification and constraining subsequent analyses." 3) In the sentence starting in line 392, instead of "During" use "For".

    1. AbstractWe present 4,157 whole-genome sequences (Korea4K) coupled with 107 health check-up parameters as the largest whole genomic resource of Koreans. Korea4K provides 45,537,252 variants and encompasses most of the common and rare variants in Koreans. We identified 1,356 new geno-phenotype associations which were not found by the previous Korea1K dataset. Phenomics analyses revealed 24 genetic correlations, 1,131 pleiotropic variants, and 127 causal relationships from Mendelian randomization. Moreover, the Korea4K imputation reference panel showed a superior imputation performance to Korea1K. Collectively, Korea4K provides the most extensive genomic and phenomic data resources for discovering clinically relevant novel genome-phenome associations in Koreans.Competing Interest StatementS.J., Y. J., H. R., Y.J.K., C.K, Yeonkyung K., Younghui K., Y. J. W., and B. C. K. are employees and Jong B. is the CEO of Clinomics Inc. The authors declare no other competing interests.

      Reviewer 2: Taras K Oleksyk, Ph.D.

      Comments to Author: The authors contribute 4,157 whole-genome sequences (Korea4K) coupled with 107 health check-up parameters as the largest genomic resource of the Korean Genome Project. It has likely characterized most of the common and very common genetic variants with commonly measured phenotypes for Koreans. It also discusses its applicability not only for the Korean population but also for other East Asian populations, and possibly to other national genome projects as well. This work makes a significant contribution of data that can be used in future genome-wide association studies in the context of the Korean population. The manuscript appears to cover a lot of ground: from methodological issues to the real-world applications of the dataset in healthcare. The authors adopt innovative methods like GREML, which have been reported to have higher accuracy compared to older methods. The authors are transparent about the limitations of their study, such as sample size and lack of sufficient data for rare diseases. They also acknowledge that phenomics analyses were not powerful enough for novel discoveries, indicating areas for future research. However, given the increasing importance of genomic data in healthcare and personalized medicine, the paper appears to be highly relevant. While the paper is well formulated, there are some issues that need to be addressed before is accepted for publication. See below: 1. You referred to the UK Biobank data for some of your analyses. Were there any limitations or caveats in comparing your dataset to the UK Biobank? What about other national genomic projects that are out there? How transferable do you think the Korea4K dataset would be to studies focusing on other populations outside East Asia? 2. Could you expand on any ethical considerations that were taken into account, especially in terms of data privacy and informed consent? 3. How was the data cleaned and preprocessed, and were there any missing data points? If so, how were these handled? What number of reads(before and after QC), and other quality metrics do the sequenced reads have? What was the average coverage across the genome? What was the read length? 4. How did you ensure the quality of the genomic data collected from different sources such as Korea1K and public data archives? The paper mentions mitigating batch effects through allele balance and manual checks. Could you provide more details on the methodology behind these checks and their efficiency? 5. Could you provide more information about the control group? Was it matched for age, sex, or other variables? How was the sample size determined, and does it provide enough statistical power to support your conclusions? 6. You mentioned that the statistical power of your study will increase with more participants. Would this have implications for other national genomes that are making similar projects? Please elaborate on how your sensitivity analysis could apply to other populations outside Korea. 7. The paper acknowledges the sample size as not sufficiently large for detecting weak associations, and admits that the sample size was not large enough to detect weak association signals. Have you considered statistical methods that can boost power in small samples? 8. Could you provide more details on the 107 clinical parameters used for the Korea4K phenome dataset? Were these parameters standardized across the different clinics and hospitals? 9. What criteria were used for initial sample filtering, particularly for excluding kinship? Could you clarify the steps taken to identify and filter the 64,301,272 SNVs and 8,776,608 Indels? How did you correct for batch effects arising from different Illumina NGS platforms and library preparations? Did you use specialized SNV calling software, or only GATK? 10. How were allele frequencies calculated and what considerations were made to interpret their biological significance? You mention that more than half of the singleton and doubleton variants were newly discovered. Could you elaborate on the methodology used to confirm these as novel variants? 11. The section on phenotypic correlations mentions 2,274 trait-trait relationships. How would you address the potential for population stratification affecting the results of your genetic and phenotypic correlations? How did you account for multiple comparisons in determining significant genetic correlations, and what corrections were applied to maintain the FDR? What measures were taken to ensure that the traits considered in this section were not subject to confounding and/or collider biases. 12. In your findings, Waist-Creatine showed opposite directions for genetic and phenotypic correlations. Could you elaborate on the potential implications or causes of this discrepancy? 13. Were there any other surprising or unexpected correlations, and what are their potential implications? 14. You mentioned that phenomics analyses were not powerful enough for novel discoveries. Could you elaborate more on what would be needed to make them more effective? 15. For the future implications, in terms of healthcare and personalized medicine, what do you see as the most immediate applications of the Korea4K dataset?

      Re-review: Thank you for providing an extensive answers to my questions. I am happy to recommend your paper to publication in its revised form.

    2. AbstractWe present 4,157 whole-genome sequences (Korea4K) coupled with 107 health check-up parameters as the largest whole genomic resource of Koreans. Korea4K provides 45,537,252 variants and encompasses most of the common and rare variants in Koreans. We identified 1,356 new geno-phenotype associations which were not found by the previous Korea1K dataset. Phenomics analyses revealed 24 genetic correlations, 1,131 pleiotropic variants, and 127 causal relationships from Mendelian randomization. Moreover, the Korea4K imputation reference panel showed a superior imputation performance to Korea1K. Collectively, Korea4K provides the most extensive genomic and phenomic data resources for discovering clinically relevant novel genome-phenome associations in Koreans.

      This work has been published in GigaScience Journal under a CC-BY 4.0 license (https://doi.org/10.1093/gigascience/giae014), and has published the reviews under the same license. These are as follows.

      Reviewer 1: Pui-Yan Kwok

      Comments to Author: This manuscript describes the second phase of the Korean Genome Project (KGP) with 4,157 sets of whole-genome data (designated Korea4K). After error correction and sequencing data curation, the whole-genome sequencing (WGS) data from 3,614 unrelated were used in the analyses. They also analyzed 107 types of clinical traits from 2,685 healthy participants' health check-up reports over a 4-year period (2016-2019). They performed a range of analyses and claimed that this new data performed better than Korea1K, the first phase KGP dataset, in a number of ways. A larger Korean dataset adds to the global genome resource and provides further insights into the Korean population. However, the results are mostly descriptive and serve as a catalog without significant new insights. The results are as expected (Korea4K is a better imputation reference panel than Korea1K, new variants are identified in the population, new variants are found in association with various phenotypes, etc.) and this dataset is sufficiently large to capture all the common variants found in the homogeneous Korean population. The authors should address several issues: 1. The use of whole genome sequencing data in GWAS. The Bonferroni correction the authors used in their analysis was that for SNP array studies. They must do a formal correction with the many more variants found in WGS data and use a statistically sound correction for their analysis. The severe penalty for multiple testing using WGS data for GWAS is why few such studies have been done. I suspect that many of the associations will not reach statistical significance after proper correction, as the dataset is quite small for most traits under study. 2. The authors should use the new genome references for their variant calling (T2T reference and the Human Pangenome Reference), as the GRCh38 is no longer the gold standard and the results will be quite different with the most up-to-date references. Using the best human genome reference will make Korea4K more valuable. 3. The authors should clarify how many of the participants who contributed clinical data are unrelated.

      Re-review:

      The authors made attempts to address the issues raised previously but did not do so adequately. 1. Using the same GWAS cutoff of P <5E-8 and adding the FDR correction (Benjamini-Hochberg) does not solve the problem of multiple testing using whole genome sequencing data (where there are orders of magnitude more variants than those on typical SNP arrays) for the study. With clinical data available for only 2,262 samples, each phenotype under study will have a very small number of individuals, making the result of 2,314 variants from 30 clinical traits with significant association highly suspect. The authors should consult statisticians with experience using whole genome sequencing data for association studies to come up with a better statistical study design. 2. The authors acknowledge that using the newer reference will be a good approach but will not do so because the "T2T reference lacks enough annotation data" is not an adequate response. The point is to have the best variant calls for the Korea4K data, annotation is irrelevant until variants with significant association are identified. Claiming that they will do so in future versions of the project diminishes the significance of the current manuscript.

    1. AbstractMost of available reference genomes are lack of the sequence map of sex-limited chromosomes, that make the assemblies uncompleted. Recent advances on long reads sequencing and population sequencing raise the opportunity to assemble sex-limited chromosomes without the traditional complicated experimental efforts. We introduce a computational method that shows high efficiency on sorting and assembling long reads sequenced from sex-limited chromosomes. It will lead to the complete reference genomes and facilitate downstream research of sex-limited chromosomes.Competing Interest StatementThe authors have declared no competing interest.

      Reviewer 3. Arang Rhie

      Comments to Author: 1. In the introduction, add recent marker based graph phasing algorithms in long-reads, such as hifiasm trio and verkko trio mode after the T2T-Y. They are different from trio-binning, which tries to phase the reads upfront. Graph based phasing is using markers to determine haplotype specific paths to traverse. a. T2T-Y chromosome should be referencing Rhie et al., Nature 2023. Verkko is a successor of the manual efforts taken in T2T-Y, which should be also noted in the introduction. b. Reference for sexPhase program is still missing. Also, some rephrasing of the sentence is needed, as the way it is currently written is easily misleading to be understood as sexPhase was part of the methods used in the assembly of the T2T-Y. 2. There are other approaches for phasing genomes taken in plants, for example the poly ploid potato phasing using many siblings of the child by Mari et al. bioRxiv 2022.3. "But only one male and one female could suffer from sampling error" - this part is unclear. Please clarify. 4. Reference for the mason_simulator, badread software is missing. 5. Provide the accession (HG02982) for the "African human Y" in the main text. 6. I appreciate that the authors compared assemblies to T2T-Y as I requested before. However, fundamentally, mapping to T2T-Y and comparing length of each sequence classes is comparing apples to oranges, particularly in the heterochromatic region and ampliconic region of the Y. It is known to have variable copy numbers and size differences between two individuals. Frequent inversions have been reported in the ampliconic regions across different Y haplogroup. The number, size, and distribution of the repeat arrays composing the heterochromatic region has been shown to vary among different Y haplogroups in Hallast et al., Nature 2023. This can be also seen in Fig. 3c; the overall depth of the flow sorting in the heterochromatic region is below 1 - indicating the Yqh is shorter than T2T-Y, as it is in Fig. 3b. To make the benchmark legit, the authors should compare SRY and the flow sorting method using samples from the same individual. HG02982 and HX1 are presumably having very different sequence compositions given the diverged population history (African vs. Asian). Comparing total length of the assembled region against a 3rd different Y haplogroup (HG002Y) makes things more complicated, especially on regions that are known to vary a lot. If the authors think flow sorting based method needs to be compared, it should be benchmarked on the same individual to make an apple-to-apple comparison. I do agree results from read sorting (i.e. portion of reads sequenced from non-Y chromosomes in SRY vs. flow-sorting) is an important finding. However, I'd still argue comparing assemblies from the two different Y haplogroups is a stretch. The authors could have performed the same assembly length comparison on the T2T-Y using results from their SRY sorted reads with Verkko of HG002 vs. Verkko assembly using trio-binned markers. 7. In the section where assemblies are compared, the authors point to Table 1, which contains results from HG01109. HG01109 has never been mentioned before. I thought the authors were comparing assemblies from SRY sorted reads of HX1? I am not sure why the authors suddenly added a 3rd PUR genome with no context. Was this a mistake? Add results from HX1 to Table 1. 8. Please add divider lines in Table 1 between All / Ampliconic / X-degenerate / X-transposed / PAR / Het / Others. It is hard to see which rows belong to which category. 9. The last result section where authors compare results from Verkko, it is unclear how the verkko assembly was run. The authors say "default option", and later "in trio mode" in the methods. Did the authors collect parental reads from HG002 (HG003 and HG004)? How was "trio mode" performed? Did the authors used trio binning to sort the reads, then run Verkko? Or used the homopolymer compressed parental kmers and used that in the Rukki step of Verkko (and this should be benchmarked)? Was the HG002 trio assembly taken from Rautiainen et al. paper? Please clarify and add the missing parts to the main text and methods. 10. Related to the above section, it is hard to see in Fig. 4a the "two approximately 1 Mb contigs aligning to the same region of the Y chromosome". An enlarged inset of the dotplot may be helpful. Also, add legends and scale to the X and Y axis of the dotplots. 11. Note there is a mis-assembly reported on T2T-Y palindrome P5 (https://github.com/marbl/CHM13-issues/blob/main/v2.0_issues.bed), which the entire P5 should be inverted. I don't see this in the dotplots of Fig. 4. 12. In the discussion, the authors are mentioning results from the 10 trios that have been removed from the previous results. Please add the 10 trio results to the main text if it was a mistake, or remove the irrelevant results from the Discussions and Supp. Tables. 13. The authors discuss the suboptimal performance of SRY in the PAR is contributed by the restricted data types. I thought it was contributed by the lower density of the markers? The PAR parental marker density was very similar to that of autosomes, with stretches of runs of homozygosity, presumably to maintain enough homology for recombination. What was the marker density in the PAR? Was it below their 7 kmer / 1kb? 14. The authors mentioned there are no ZW genomes available to test SRY. There is a Zebra finch trio (ZW, female, bTaeGut2) and a male sample (ZZ, male, bTaeGut1) available with HiFi of the child (bTaeGut2) and Illumina of all the genomes from the Vertebrate Genomes Project (Rhie et al., Nature, 2021). Perhaps the authors could apply SRY on this individual, and compare the W chromosome results to what has been released on https://www.genomeark.org/vgp-all/Taeniopygia_guttata.html.

      Re-review: The authors have addressed most of my concerns. The revised manuscript reads much better than before. Regarding my last comment and response from the authors about the W chromosome, I was hoping to see comparable coverage of the W chromosome to the reference, as a proof of principle that SRY could be applied to non-human, highly diverged genomes. The assembly looks very fragmented though. Was it only the similarity to the Z chromosome that caused the fragmentation? Are there no other factors contributing to the discontinuity of the W chromosome? A few minor comments below to the revised version: 1. Please indicate which genome was compared in the legend of Supp. Table 5. 2.When using et al notations, please use the last name. Mari et al should be Serra Mari et al., Mikko et al should be Rautiainen et al. Also, Serra Mari et al is now published in Genome Biology: https://doi.org/10.1186/s13059-023-03160-z. Please update the reference. 3. There are a few grammar corrections to make.

    2. AbstractMost of available reference genomes are lack of the sequence map of sex-limited chromosomes, that make the assemblies uncompleted. Recent advances on long reads sequencing and population sequencing raise the opportunity to assemble sex-limited chromosomes without the traditional complicated experimental efforts. We introduce a computational method that shows high efficiency on sorting and assembling long reads sequenced from sex-limited chromosomes. It will lead to the complete reference genomes and facilitate downstream research of sex-limited chromosomes.Competing Interest Statement

      Reviewer 2. Shilpa Garg

      Comments to Author: The SRY method, developed and evaluated for sorting long reads of sex-limited chromosomes, has shown promise in effectively identifying and sorting sequences based on sex-specific markers, particularly the Y chromosome. These sorted long reads are then utilized for genome assembly. Additionally, the SRY method can be used to select Y chromosome contigs from a male individual's whole genome assembly. Overall, the success of SRY in sorting and assembling long reads of sex-limited chromosomes highlights its potential as an alternative to experimental methods for studying sex-specific genomic regions. Here are some comments for further improvement of manuscript: 1) The authors may want to consider to presenting a table for standard evaluation metrics (k-mer or alignment-based). See Garg 2021 (https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02328-9).2) Adding a few important genes that are medically relevant and assembled properly may further add value to the work.

    3. AbstractMost of available reference genomes are lack of the sequence map of sex-limited chromosomes, that make the assemblies uncompleted. Recent advances on long reads sequencing and population sequencing raise the opportunity to assemble sex-limited chromosomes without the traditional complicated experimental efforts. We introduce a computational method that shows high efficiency on sorting and assembling long reads sequenced from sex-limited chromosomes. It will lead to the complete reference genomes and facilitate downstream research of sex-limited chromosomes.

      This work has been published in GigaScience Journal under a CC-BY 4.0 license (https://doi.org/10.1093/gigascience/giae015), and has published the reviews under the same license. These are as follows.

      Reviewer 1: Zuyao Liu, Ph.D

      Comments to Author: The authors have introduced a novel bioinformatic approach for sex chromosome assembly, addressing a persistently challenging problem in genomics. This method harnesses the full potential of whole-genome resequencing data without necessitating supplementary experimental procedures, rendering it applicable to a wide array of non-model species. Notably, the method exhibits robustness when applied to human data, surpassing established techniques such as flow-sorting and trio-binning. While the manuscript exhibits promise, several key aspects warrant refinement and elucidation to bolster its consideration for publication in GigaScience. 1. Language Polishing: A degree of language refinement is advisable to enhance the overall clarity and professionalism of the manuscript. 2. Y Chromosome Assembly Discrepancy: The authors should acknowledge and provide an explanation for the substantial difference between the length of the latest Y chromosome assembly from T2T (~62Mb) and the assembly from SRY with Verkko (~23Mb), as detailed in Table 1. 3. Y Chromosome Completeness: In cases where the Y chromosome assembly is incomplete, the inclusion of a figure or table delineating the proportion that SRY can recover in distinct regions of the Y chromosome would be beneficial. This could facilitate a comparative analysis of the method's efficacy across different regions. 4. Figure 4 Clarity: It is imperative to label the coordinates on both the X and Y axes in Figure 4 to enhance clarity. While Figure 4 suggests that the assembly from SRY is complete compared to T2T-CHM13, the total length of the SRY assembly (approximately 23Mb) should be clearly reconciled with this observation. 5. Table 1 Organization: The organization of Table 1 should be improved to enhance readability and comprehensibility. 6. MSK-Based Read Filtering: Authors should explicitly address the potential exclusion of reads from Y regions with lower than average MSK, especially in species with both young and old parts on Y chromosomes. If possible, provide recommendations or strategies for rescuing such reads. 7. Simulation for species with young sex chromosomes: It is essential to conduct additional simulations for testing the efficiency of isolating Y reads for species with young sex chromosomes. This analysis should consider the variation between X and Y chromosomes, aiding researchers in evaluating the method's suitability for their specific study organisms.

      Addressing these points will further strengthen the manuscript's scientific rigor and its suitability for publication in GigaScience.

      Re-review: After reading the revised article, the questions I had previously posed were answered. I am very interested in this SRY method and believe it is also an important part of sex chromosome research. From my personal point of view, it is not easy to collect Trio data for most species except a few, but it is relatively easy to collect HIC data. It would be helpful if the authors could also compare the results of SRY HIFI with those of Hifiasm (HIC phased) to help people choose the right tool for sex chromosome assembly. However, this is not necessary, because SRY has achieved a very good result in humans. Overall, the data and results are convincing.

    1. ABSTRACTAs adoption of nanopore sequencing technology continues to advance, the need to maintain large volumes of raw current signal data for reanalysis with updated algorithms is a growing challenge. Here we introduce slow5curl, a software package designed to streamline nanopore data sharing, accessibility and reanalysis. Slow5curl allows a user to fetch a specified read or group of reads from a raw nanopore dataset stored on a remote server, such as a public data repository, without downloading the entire file. Slow5curl uses an index to quickly fetch specific reads from a large dataset in SLOW5/BLOW5 format and highly parallelised data access requests to maximise download speeds. Using all public nanopore data from the Human Pangenome Reference Consortium (>22 TB), we demonstrate how slow5curl can be used to quickly fetch and reanalyse signal reads corresponding to a set of target genes from each individual in large cohort dataset (n = 91), minimising the time, egress costs, and local storage requirements for their reanalysis. We provide slow5curl as a free, open-source package that will reduce frictions in data sharing for the nanopore community: https://github.com/BonsonW/slow5curlCompeting Interest StatementI.W.D. manages a fee-for-service sequencing facility at the Garvan Institute of Medical Research that is a customer of Oxford Nanopore Technologies but has no further financial relationship. H.G., J.M.F. and I.W.D. have previously received travel and accommodation expenses from Oxford Nanopore Technologies. The authors declare no other competing financial or non-financial interests.

      Reviewer 3. Guillermo Dufort y Alvarez

      This paper introduces slow5curl, a software tool that extends slow5lib, a previous tool developed by the authors. The tool allows users to retrieve raw nanopore sequencing reads from remote BLOW5 files, a novel format that has several advantages over FAST5 and POD5, which are the most widely used formats for storing raw nanopore sequencing data. BLOW5 is not yet a standard format, but this tool could encourage its adoption and the development of similar tools in the future. The paper is well written, clear, and concise, and the tool is tested on various scenarios. The GitHub repository provides clear instructions and examples for building and using the tool. My comments to the authors are: Major 1. I am concerned about the main use case of the tool, which is to obtain a subset of raw nanopore reads that align to a specific region (e.g., a gene), in order to re-basecall them with a new software tool. This assumes that the alignment region of the original basecall is consistent with the new basecall, which may not be true. The new basecall sequences may align better to a different region, and some sequences that were not retrieved may align well to the desired region. This affects the precision and recall of the process. I would like the authors to address this issue, by either providing evidence that this is rare, or explaining why the tool is still useful despite this limitation. 2. The tool depends on the availability of a BAM file for the raw reads, which is uploaded along with the BLOW5 file and its index. In the section Fetching reads from a large cohort, the authors claim that storing the raw nanopore data with its index reduces the size by 29.7% compared to FAST5. However, they do not consider the size of the BAM file, which is required for the main use case. I would like the authors to address this, by either reporting the size of the BAM files, or justifying why their size is irrelevant for this comparison.

      Minor 1. In section RESULTS, in line two, delete the repeated word simple from "simple BLOW5 simple".

      Re-review. The authors correctly addressed each one of the comments I made. From my side, no further changes are needed for publication. Great work.

    2. As adoption of nanopore sequencing technology continues to advance, the need to maintain large volumes of raw current signal data for reanalysis with updated algorithms is a growing challenge. Here we introduce slow5curl, a software package designed to streamline nanopore data sharing, accessibility and reanalysis. Slow5curl allows a user to fetch a specified read or group of reads from a raw nanopore dataset stored on a remote server, such as a public data repository, without downloading the entire file. Slow5curl uses an index to quickly fetch specific reads from a large dataset in SLOW5/BLOW5 format and highly parallelised data access requests to maximise download speeds. Using all public nanopore data from the Human Pangenome Reference Consortium (>22 TB), we demonstrate how slow5curl can be used to quickly fetch and reanalyse signal reads corresponding to a set of target genes from each individual in large cohort dataset (n = 91), minimising the time, egress costs, and local storage requirements for their reanalysis. We provide slow5curl as a free, open-source package that will reduce frictions in data sharing for the nanopore community: https://github.com/BonsonW/slow5curlCompeting Interest Statement

      Reviewer 2. Yunfan Fan

      Comments to Author: In this manuscript, the authors demonstrate a highly streamlined method for downloading targeted subsets of raw ONT electrical signals, for re-analysis. In my view, this will be a highly useful tool for researchers working with public nanopore data, and I hope to see its widespread adoption. The benchmarks are well-described in the manuscript, and the code is publicly available and well-documented. I have no other notes or suggestions for the authors.

    3. ABSTRACTAs adoption of nanopore sequencing technology continues to advance, the need to maintain large volumes of raw current signal data for reanalysis with updated algorithms is a growing challenge. Here we introduce slow5curl, a software package designed to streamline nanopore data sharing, accessibility and reanalysis. Slow5curl allows a user to fetch a specified read or group of reads from a raw nanopore dataset stored on a remote server, such as a public data repository, without downloading the entire file. Slow5curl uses an index to quickly fetch specific reads from a large dataset in SLOW5/BLOW5 format and highly parallelised data access requests to maximise download speeds. Using all public nanopore data from the Human Pangenome Reference Consortium (>22 TB), we demonstrate how slow5curl can be used to quickly fetch and reanalyse signal reads corresponding to a set of target genes from each individual in large cohort dataset (n = 91), minimising the time, egress costs, and local storage requirements for their reanalysis. We provide slow5curl as a free, open-source package that will reduce frictions in data sharing for the nanopore community: https://github.com/BonsonW/slow5curl

      This work has been published in GigaScience Journal under a CC-BY 4.0 license (https://doi.org/10.1093/gigascience/giae016), and has published the reviews under the same license. These are as follows.

      Reviewer 1: Jan Voges

      Comments to Author: The manuscript provides a detailed overview of the proposed technology, with an emphasis on reproducibility through precise software version and command line documentation. Although slow5curl is a rather simple implementation of a curl-based streaming for nanopore data, it is extensively evaluated. In this way, its value to the nanopore community is made clear. I do have a few minor comments:

      Introduction: The importance of preserving raw signal data needs to be more clearly articulated. There is a view within the community that reads that have undergone high-accuracy base calling and methylation calling are sufficient for distribution and long-term storage. The clarification on the importance of raw data retention would strengthen the introduction.

      Results: Please rephrase "[…]fetch a specific read(s) […]". Results: It should be stated more explicitly that BLOW5 is a compressed data representation and therefore suitable for streaming. Results: "The simple BLOW5 simple file-structure[…]" -> "The simple BLOW5 file structure […]"

      Discussion: "[…] users must upload a single FAST5 tarball for a given datasets" -> "[…] users must upload a single FAST5 tarball for a given dataset" Discussion: While the SLOW5 ecosystem is described in detail, it would be beneficial to discuss whether there are any alternative solutions or technologies that provide a comparative perspective. Discussion: It would be interesting to discuss the possible standardization of the SLOW5 ecosystem. What is the vision? An academically centered open-source ecosystem? A proprietary system? A more "formal" standard (GA4GH, ISO/IEC)?

    1. Dynamic functional connectivity (dFC) has become an important measure for understanding brain function and as a potential biomarker. However, various methodologies have been developed for assessing dFC, and it is unclear how the choice of method affects the results. In this work, we aimed to study the results variability of commonly-used dFC methods. We implemented seven dFC assessment methods in Python and used them to analyze fMRI data of 395 subjects from the Human Connectome Project. We measured the pairwise similarity of dFC results using several similarity metrics in terms of overall, temporal, spatial, and inter-subject similarity. Our results showed a range of weak to strong similarity between the results of different methods, indicating considerable overall variability. Surprisingly, the observed variability in dFC estimates was comparable to the expected natural variation over time, emphasizing the impact of methodological choices on the results. Our findings revealed three distinct groups of methods with significant inter-group variability, each exhibiting distinct assumptions and advantages. These findings highlight the need for multi-analysis approaches to capture the full range of dFC variation. They also emphasize the importance of distinguishing neural-driven dFC variations from physiological confounds, and developing validation frameworks under a known ground truth. To facilitate such investigations, we provide an open-source Python toolbox that enables multi-analysis dFC assessment. This study sheds light on the impact of dFC assessment analytical flexibility, emphasizing the need for careful method selection and validation, and promoting the use of multi-analysis approaches to enhance reliability and interpretability of dFC studies.Competing Interest StatementThe authors have declared no competing interest.

      Reviewer 2. Nicolas Farrugia

      Comments to Author: Summary of review This paper fills a very important gap in the literature investigating time-varying functional connectivity (or dynamic functional connectivity, dFC), by measuring analytical flexibility of seven different dFC methods. An impressive amount of work has been put up to generate a set of convincing results, that essentially show that the main object of interest of dFC, which is the temporal variability of connectivity, cannot be measured with a high consistency, as this variability is of the same order of magnitude or even higher than the changes observed across different methods on the same data. In this very controversial field, it is very remarkable to note that the authors have managed to put together a set of analysis to demonstrate this in a very clear and transparent way. The paper is very well written, the overall approach is based on a few assumptions that make it possible to compare methods (e.g. subsampling of temporal aspects of some methods, spatial subsampling), and the provided analysis is very complete. The most important results are condensed in a few figures in the main manuscript, which is enough to convey the main messages. The supplementary materials provide an exhaustive set of additional results, which are shortly discussed one by one. Most importantly, the authors have provided an open source implementation of 7 main dfc methods. This is very welcome for the community and for reproductibility, and is of course particularly suited for this kind of contribution. A few suggestions follow. Clarification questions and suggestions : 1- How was the uniform downsampling of 286 ROI to 96 done ? Uniform in which sense ? According to the RSN ? Were ROIs regrouped with spatial contiguity ? I understand this was done in order to reduce computational complexity and to harmonize across methods, but the manuscript would benefit from having an added sentence to explain what was done. 2- Table A in figure 1 shows the important hyperparameters (HP) for each method, but the motivations regarding the choice of HP for each method is only explained in the discussion (end of page 11, "we adopted the hyperparameter values recommended by the original paper or consensus among the community for each method"). It would be better to explain it in the methods, and then only discuss why this can be a limitation, in the discussion. 3- The github repository https://github.com/neurodatascience/dFC/tree/main does not reference the paper 4- The github repository https://github.com/neurodatascience/dFC/tree/main is not documented enough. There are two very large added values in this repo : open implementation of methods, and analytical flexibility tools. The demo notebook shows how to use the analytical flexibility tools, but the methods implementation is not documented. I expect that many people will want to perform analysis using the methods as well as comparison analysis, so the documentation of individual methods should not be minimized. 5 - For the reader, it would be better to include early in the manuscript (in the introduction) the presence of the code for reproductibility. Currently, the toolbox is only introduced in the final paragraph of the discussion. It comes as a very nice suprise when reading the manuscript in full, but I think the manuscript would gain a lot of value if this paragraph was included earlier, and if the development of the toolbox was included much earlier (ie. in the abstract). 6 - We have published two papers on dFC that the authors may want to include, although these papers have investigated cerebello-cerebral dFC using whole brain + cerebellum parcellations. The first paper used continuous HMM on healthy subjects, and found correlations with impulsivity scores, while the second papers used network measures on sliding window dFC matrices on a clinical cohort (patients with alcohol use disorder). I am not sure why the authors have not found our papers in their litterature, but maybe it would be good to include them. Authors need to update the final table in supplementary materials as well as the citations in the main paper. Abdallah, M., Farrugia, N., Chirokoff, V., & Chanraud, S. (2020). Static and dynamic aspects of cerebro-cerebellar functional connectivity are associated with self-reported measures of impulsivity: A resting-state fMRI study. Network Neuroscience, 4(3), 891-909. Abdallah, M., Zahr, N. M., Saranathan, M., Honnorat, N., Farrugia, N., Pfefferbaum, A., Sullivan, E. & Chanraud, S. (2021). Altered cerebro-cerebellar dynamic functional connectivity in alcohol use disorder: a resting-state fMRI study. The Cerebellum, 20, 823-835. Note that in Abdallah et al. (2020), while we did not compare HMM results with other dFC methods, we did investigate the influence of HMM hyperparameters, as well as perform internal cross validation on our sample + null models of dFC.

      Minor comments 6 - "[..] what lies behind the of methods. Instead, they reveal three groups of methods, 720 variations in dynamic functional connectivity?. " -> an extra "." was added (end of page 10).

    2. AbstractDynamic functional connectivity (dFC) has become an important measure for understanding brain function and as a potential biomarker. However, various methodologies have been developed for assessing dFC, and it is unclear how the choice of method affects the results. In this work, we aimed to study the results variability of commonly-used dFC methods. We implemented seven dFC assessment methods in Python and used them to analyze fMRI data of 395 subjects from the Human Connectome Project. We measured the pairwise similarity of dFC results using several similarity metrics in terms of overall, temporal, spatial, and inter-subject similarity. Our results showed a range of weak to strong similarity between the results of different methods, indicating considerable overall variability. Surprisingly, the observed variability in dFC estimates was comparable to the expected natural variation over time, emphasizing the impact of methodological choices on the results. Our findings revealed three distinct groups of methods with significant inter-group variability, each exhibiting distinct assumptions and advantages. These findings highlight the need for multi-analysis approaches to capture the full range of dFC variation. They also emphasize the importance of distinguishing neural-driven dFC variations from physiological confounds, and developing validation frameworks under a known ground truth. To facilitate such investigations, we provide an open-source Python toolbox that enables multi-analysis dFC assessment. This study sheds light on the impact of dFC assessment analytical flexibility, emphasizing the need for careful method selection and validation, and promoting the use of multi-analysis approaches to enhance reliability and interpretability of dFC studies.

      This work has been published in GigaScience Journal under a CC-BY 4.0 license (https://doi.org/10.1093/gigascience/giae009), and has published the reviews under the same license. These are as follows.

      Reviewer 1: Yara Jo Toenders

      Comments to Author: The authors performed an in-depth comparison of 7 dynamic functional connectivity methods. The paper includes many figures that are greatly appreciated as they clearly demonstrate the findings. Moreover, the authors developed a Python toolbox to implement these 7 methods. The results showed that the results were highly variable, although three clusters of similar methods could be detected. However, after reading the manuscript, there are some remaining questions. - The TR and timepoints of the fMR images are shown, but other acquisition parameters such as the voxel size are missing. Could all acquisition parameters please be provided? - Could more information be provided on the downsampling of the 286 to 96 ROIs? How was this done and what were the 96 ROIs that were created? - In the results it is explained that the definition of groups depended on the cutoff value of the clustering, however it is unclear how the cutoff value was determined. Could the authors elucidate this how this was done? - The difference between the subplots in Figure 3 is a bit difficult to understand because the labels of the different methods switch places. Perhaps the same colour could be used for the cluster of the continuous HMM, Clustering and Discrete HMM method to increase readability? - Figure 4b shows that the default mode network is more variable over methods than time, while the auditory and visual are not. Could the authors explain what may underlie this discrepancy? - From the introduction it became clear that many studies have used dFC to study clinical populations, while I understand that no single recommendation can be given, not every clinical study might have the capacity to use all 7 methods. What would the authors recommend these clinical studies? Would there for example be a method that would be recommended within each of the three clusters? - It could be helpful if the authors create DOIs for their toolbox code bases that could be cited in a manuscript, rather than linking to bare GitHub URLs. One potentially useful guide is: https://guides.github.com/activities/citable-code/

    1. Background Culture-free real-time sequencing of clinical metagenomic samples promises both rapid pathogen detection and antimicrobial resistance profiling. However, this approach introduces the risk of patient DNA leakage. To mitigate this risk, we need near-comprehensive removal of human DNA sequence at the point of sequencing, typically involving use of resource-constrained devices. Existing benchmarks have largely focused on use of standardised databases and largely ignored the computational requirements of depletion pipelines as well as the impact of human genome diversity.Results We benchmarked host removal pipelines on simulated Illumina and Nanopore metagenomic samples. We found that construction of a custom kraken database containing diverse human genomes results in the best balance of accuracy and computational resource usage. In addition, we benchmarked pipelines using kraken and minimap2 for taxonomic classification of Mycobacterium reads using standard and custom databases. With a database representative of the Mycobacterium genus, both tools obtained near-perfect precision and recall for classification of Mycobacterium tuberculosis. Computational efficiency of these custom databases was again superior to most standard approaches, allowing them to be executed on a laptop device.Conclusions Nanopore sequencing and a custom kraken human database with a diversity of genomes leads to superior host read removal from simulated metagenomic samples while being executable on a laptop. In addition, constructing a taxon-specific database provides excellent taxonomic read assignment while keeping runtime and memory low. We make all customised databases and pipelines freely available.Competing Interest StatementThe authors have declared no competing interest.

      Reviewer 2. Darrin Lemmer, M.S.

      Comments to Author: This paper describes a method for improving the accuracy and efficiency of extracting a pathogen of interest (M. tuberculosis in this instance, though the methods should work equally well for other pathogens) from a "clinical" metagenomic sample. The paper is well written and provides links to all source code and datasets used, which were well organized and easy to understand. The premise – that using a pangenome database improves classification -- seems pretty intuitive, but it is nice to see some benchmarking to prove it. For clarity I will arrange my comments by the three major steps of your methods: dataset generation, human read removal, and Mycobacterium read classification. 1. Dataset generation -- I appreciate that you used a real-world study (reference #8) to approximate the proportions of organisms in your sample, however I am disappointed that you generated exactly one dataset for benchmarking. Even if you use the exact same community composition, there is a level of randomness involved in generating sequencing reads, and therefore some variance. I would expect to see multiple generations and an averaging of the results in the tables, however with a sufficiently high read depth, the variance won't likely change your results much, so it would be nice, and more true to real sequencing data, to vary the number of reads generated (I didn't see where you specified to what read depth for each species you generated the reads for), as it is rare in the real world to always get this deep of coverage. Ideally it would also be nice to see datasets varying the proportions of MTBC in the sample to test the limits of detection, but that may be beyond the scope of this particular paper. 2. Human read removal -- The data provided do not really support the conclusion, as all methods benchmarked performed quite well and, particularly when using the long reads from the Nanopore simulated dataset, fairly indistinguishable with the exception of HRRT. The short Illumina reads show a little more separation between the methods, probably due to the shorter sequences being able to align to multiple sequences in the reference databases, however comparing kraken human to kraken HPRC still shows very little difference, thus not supporting the conclusion that the pangenome reference provides "superior" host removal. The run times and memory used do much more to separate the performance of the various methods, and particularly with the goal of being able to run the analysis on a personal computer where peak memory usage is important. The only methods that perform well within the memory constraints of a personal computer for both long reads and short leads are HRRT and the two kraken methods, with kraken being superior at recall, but again, kraken human and kraken HPRC are virtually indistinguishable, making it hard to justify the claim that the pangenome is superior. Also, it appears your run time and peak memory usage is again based on one single data point, these should be performed multiple times and averaged. Finally, as an aside, I did find it interesting and disturbing that HRRT had such a high false negative rate compared to the other methods, given that this is the primary method used by NCBI for publishing in the SRA database, implying there are quite a few human remaining in SRA. 3. Mycobacterium read classification -- Here we do have some pretty good support for using a pangenome reference database, particularly compared to the kraken standard databases, though as mentioned previously, a single datapoint isn't really adequate, and I'd like to see both multiple datasets and multiple runs of each method. Additionally, given the purpose here is to improve the amount of MTB extracted from a metagenomic sample, these data should be taken the one extra step to show the coverage breadth and depth of the MTB genome provided by the reads classified as MTB, as a high number of reads doesn't mean much if they are all stacked at the same region of the genome. Given that these are simulated reads, which tend to have pretty even genome coverage, this may not show much, however it is still an important piece to show the value of your recommended method. One final comment is that it should be fairly easy to take this beyond a theoretical exercise, by running some actual real world datasets through the methods you are recommending to see how well they perform in actuality. For instance, reference #8, which you used as a basis for the composition of your simulated metagenomic sample, published their actual sequenced sputum samples. It would be easy to show if you can improve the amount of Mycobacterium extracted from their samples over the methods they used, thus showing value to those lower income/high TB burden regions where whole metagenome sequencing may be the best option they have.

      Re-review.

      This is a significantly stronger paper than originally submitted. I especially appreciate that multiple runs have now been done with more than one dataset, including a "real" dataset, and the analysis showing the breadth and depth of coverage of the retained Mtb reads, proving that you can still generally get a complete genome of a metagenomic sample with these methods. However kraken's low sensitivity when using the standard database definitely impacts the results, making a stronger argument for using a pangenome database (Kraken-Standard can identify the presence of Mtb, but if you want to do anything more with it, like AMR detection, you would need to use a pangenome database). I really think that this should be emphasized more, and perhaps some or all of the data in tables S9-S12 be brought into the main paper. It is maybe worth noting, that the significant drop in breadth, I would imagine, is a result of dividing the total size of the aligned reads by the size of the genome, implying a shallow coverage, but the reality is still high coverage in the areas that are covered, but lots of complete gaps in coverage. I did also like the switch to the somewhat more standard sensitivity/specificity metrics, though I do lament the actual FN/FP counts being relegated to the supplemental tables, as I thought these numbers valuable (or at least interesting) when comparing the results of the various pipelines, particularly with human read removal, where the various pipelines perform quite similarly.

    2. Background Culture-free real-time sequencing of clinical metagenomic samples promises both rapid pathogen detection and antimicrobial resistance profiling. However, this approach introduces the risk of patient DNA leakage. To mitigate this risk, we need near-comprehensive removal of human DNA sequence at the point of sequencing, typically involving use of resource-constrained devices. Existing benchmarks have largely focused on use of standardised databases and largely ignored the computational requirements of depletion pipelines as well as the impact of human genome diversity.Results We benchmarked host removal pipelines on simulated Illumina and Nanopore metagenomic samples. We found that construction of a custom kraken database containing diverse human genomes results in the best balance of accuracy and computational resource usage. In addition, we benchmarked pipelines using kraken and minimap2 for taxonomic classification of Mycobacterium reads using standard and custom databases. With a database representative of the Mycobacterium genus, both tools obtained near-perfect precision and recall for classification of Mycobacterium tuberculosis. Computational efficiency of these custom databases was again superior to most standard approaches, allowing them to be executed on a laptop device.Conclusions Nanopore sequencing and a custom kraken human database with a diversity of genomes leads to superior host read removal from simulated metagenomic samples while being executable on a laptop. In addition, constructing a taxon-specific database provides excellent taxonomic read assignment while keeping runtime and memory low. We make all customised databases and pipelines freely available.

      This work has been published in GigaScience Journal under a CC-BY 4.0 license (https://doi.org/10.1093/gigascience/giae010), and has published the reviews under the same license. These are as follows.

      Reviewer 1: Ben Langmead

      Reviewer Comments to Author: Mycobacterium tuberculosis is a leading cause of death and is increasingly investigated using DNA sequencing of e.g. sputum. However sequencing data has to be handled carefully to remove reads from the human host and to correct classify the M. tuberculosis reads. The authors focus on this problem of computationally extracting only the M. tuberculosis reads from a simulated human sample, measuring the accuracy and time/memory resources used by different approaches. They begin by focusing on removing human reads using different alignment and k-mer-based tools. The authors discover, interestingly, that using a Kraken database over all the HPRC genomes leads to the best balance of resources and accuracy. Next, the authors focus on classifying the remaining reads with various databases, some general across the tree-of-life and some limited to M. tuberculosis. For this task, the authors identify the custom Mycobacterium databases as being the best choice to correctly identify tuberculosis efficiently. The paper is very clear and well written.

      Major Comments: 1. While the host-depletion and mycobacterium classification do tell us something, some of the numbers are quite small, leading me to wonder how robust the results are. The question lingers: should we really be making dedicions based on a simulation study where results are similar out to the fifth decimal point? There is definitely information here, but additionally evaluating real datasets or still-larger simulated ones could make the results more actionable. 2. In the Mycobacterium experiment, it does not seem approriate to use the reads not classified by Kraken as the inputs, since Kraken is what is being benchmarked in the first place. Given this is a simulation data, an alternative would be to use true non-human reads as the input. 3. The Discussion could be improved with some discussion of whether these approaches could generalize to other taxa, or other host/pathogen combinations. 4. In Table 3, the authors point out that minimap2 is the only tool to misclassify Mycobacterium reads as Human. Relatively, it has the most FPs compared to the other tools. Did the authors investigate where those alignments fell within CHM13v2? it's mentioned that Hostile uses minimap2 but with extra filtering, so I'm surprised it only has 4 FPs. Does that make sense given the specific filtering steps it performs after alignment? 5. I may have missed it, but the authors should characterize the error rate for the simulated ONT and Illumina reads somewhere. Saying the "default model" is used doesn't help the reader to understand the error profile. Minor Comments: 1. Kraken has substantially low recall using the standard and standard-8 databases in Tables 4 and 5. Those were the only times a tool had a recall below 97%. Is this expected? Perhaps because key strains are missing from the database? This wasn't explained. 2. The units of the peak memory in the tables are in MB, but memory thresholds are described using GB units in the paper. Consider changing tables to be in GB. 3. At the end of the host-depletion section, it's mentioned that all missed human reads (FNs) were from unplaced scaffolds. Is it known if those matches are due to contamination in the assembly? Contigs under 10 kb were filtered, but there could be contaminated contigs above that length.

      Re-review: The authors have substantially addressed the points I raised in my review.

    1. AbstractBiomedical research frequently uses murine models to study disease mechanisms. However, the translation of these findings to human disease remains a significant challenge. In order to improve the comparability of mouse and human data, we present a cross-species integration pipeline for single-cell transcriptomic assays.The pipeline merges expression matrices and assigns clear orthologous relationships. Starting from Ensembl ortholog assignments, we allocated 82% of mouse genes to unique orthologs by using additional publicly available resources such as Uniprot, and NCBI databases. For genes with multiple matches, we employed the Needleman-Wunsch global alignment based on either amino acid or nucleotide sequence to identify the ortholog with the highest degree of similarity.The workflow was tested for its functionality and efficiency by integrating scRNA-seq datasets from heart failure patients with the corresponding mouse model. We were able to assign unique human orthologs to up to 80% of the mouse genes, utilizing the known 17,492 orthologous pairs. Curiously, the integration process enabled the identification of both common and unique regulatory pathways between species in heart failure.In conclusion, our pipeline streamlines the integration process, enhances gene nomenclature alignment and simplifies the translation of mouse models to human disease. We have made the OrthoIntegrate R-package accessible on GitHub (https://github.com/MarianoRuzJurado/OrthoIntegrate), which includes the assignment of ortholog definitions for human and mouse, as well as the pipeline for integrating single cells.KeypointsNovel integration workflow for scRNA-seq data from different species in an easy to use R-package (“OrthoIntegrate”).Improved one-to-one ortholog assignment via sequence similarity scores and string similarity calculations.Validation of “OrthoIntegrate” results with a case study of snRNA-seq from human heart failure with reduced ejection fraction and its corresponding mouse model

      Reviewer 2: Yinqi Bai

      Comments to Author: Jurado et al. reported a pipeline designed to optimize the detection of orthologous genes and utilized it to enhance the integration of cross-species single-cell RNA sequencing (scRNA-seq) data. They demonstrated the effectiveness of this pipeline by comparing shared and distinct regulatory pathways between human HFrEF (Heart Failure with Reduced Ejection Fraction) patients and the corresponding mouse model. The work provided reliable results that emphasize the importance of exercising caution when using mouse models to study disease mechanisms. However, many important factors should be critically thought about and benchmarked. Here are a few major issues: 1. Ortholog identification has long been a critical and essential step for many comparative, evolutionary, and functional genomic analyses. To evaluate the performance of an orthology inference method, there are some gold standards available for benchmark testing, such as the Quest Orthology Benchmark Service (https://orthology.benchmarkservice.org). Whether OrthoIntegrate outperforms other methods should be comprehensively benchmarked on diverse datasets and metrics, rather than relying solely on the silhouette coefficient score from a heart single-cell RNA sequencing (scRNA-seq) dataset. 2. According to the authors' integration pipeline, both human and mouse scRNA-seq data are individually clustered to assign cell type labels and are then further integrated with orthologous genes for clustering to assign new labels. How do the labels for each cell and each cell type change before and after the integration approach? Does cell type assignment become more reasonable after the integration? The authors should demonstrate that the selection of orthologous genes for clustering improves the accuracy of cell type assignment. The silhouette coefficient score is not a direct metric for assessing accuracy, as it can be influenced by biological factors. For example, in Supplementary Table 3, the silhouette scores of mouse-HFrEF samples generated by Paranoid and OMA are consistently higher than those by OrthoIntegrate, which is opposite to the control groups and human-HFrEF samples. 3. The data analysis needs to be expanded further if there are findings with potential biological significance. For example, the authors mentioned, 'In cluster 25, we observe a group of genes showing increased expression in human FBs, and we also identify a set of genes that are negatively regulated in cluster 28 in human ECs.' However, there is no functional analysis, such as GO or KEGG pathway enrichment analysis, conducted to interpret the data and validate these findings. 4. The discussion section is confusing. The authors should clarify whether the paper is primarily focused on research methods or data analysis. If it is a data analysis paper, the authors should conduct additional investigations to include further data analysis. If it is a research method paper, the authors should extend the discussion to relate to the algorithm itself.

      Minor comments: 1. The cell number for each sample and each clustered cell type is critical for assessing the reliability of the results; however, this information is not provided in the paper. 2. As the mouse model is generated through chronic infarction, it raises the question of why very few T/B cell markers are found in immune cells in Figure 1F. Is it possible that these cell types are not adequately captured in the mouse samples? In data integration analysis, the audience may be more interested in understanding how species-specific cell types perform, particularly when, for instance, only macrophages are the dominant immune cells found in human samples. 3. On page 5, clarify "latter ones" in the sentence "Most of the latter ones were long non-coding RNAs with identical gene names." 4. On page 5, correct the reference to Supplementary Figure 4A instead of Supplementary Figure 3A and Supplementary Table 3. 5. On page 16, replace "regulated genes" with "differentially expressed genes (DEGs)" to accurately represent what the authors referred.

      Re-review:

      The author's additional analysis is commendable. With the inclusion of new evaluation metrics, the benchmark section now appears relatively comprehensive, and the explanations provided for the reduced NMI score are reasonable. In the results section, the supplementary information on functional enrichment further elucidates the biological functions of fibroblast cluster 25 and endothelial cell cluster. 28. There are still some minor suggestions for improvement: 1. The presentation of the biological findings in the discussion section could be more succinct to improve clarity. 2. There is a lack of discussion on the impact of the numerous lncRNAs generated by OrthoIntegrate. This topic requires further exploration and elaboration. 3. Reorganize the paragraphs for "Single cell pre-processing" and "Study samples" to clarify the source of the data used in the article. Emphasize the data generated by authors (E-MTAB-13264) and provide details on the single-cell sequencing process (not only the raw data pre-processing).

    2. AbstractBiomedical research frequently uses murine models to study disease mechanisms. However, the translation of these findings to human disease remains a significant challenge. In order to improve the comparability of mouse and human data, we present a cross-species integration pipeline for single-cell transcriptomic assays.The pipeline merges expression matrices and assigns clear orthologous relationships. Starting from Ensembl ortholog assignments, we allocated 82% of mouse genes to unique orthologs by using additional publicly available resources such as Uniprot, and NCBI databases. For genes with multiple matches, we employed the Needleman-Wunsch global alignment based on either amino acid or nucleotide sequence to identify the ortholog with the highest degree of similarity.The workflow was tested for its functionality and efficiency by integrating scRNA-seq datasets from heart failure patients with the corresponding mouse model. We were able to assign unique human orthologs to up to 80% of the mouse genes, utilizing the known 17,492 orthologous pairs. Curiously, the integration process enabled the identification of both common and unique regulatory pathways between species in heart failure.In conclusion, our pipeline streamlines the integration process, enhances gene nomenclature alignment and simplifies the translation of mouse models to human disease. We have made the OrthoIntegrate R-package accessible on GitHub (https://github.com/MarianoRuzJurado/OrthoIntegrate), which includes the assignment of ortholog definitions for human and mouse, as well as the pipeline for integrating single cells.

      This work has been published in GigaScience Journal under a CC-BY 4.0 license (https://doi.org/10.1093/gigascience/giae011), and has published the reviews under the same license. These are as follows.

      Reviewer 1: Ruoyan Li

      Comments to Author: In the manuscript entitled 'Improved integration of single cell transcriptome data demonstrates common and unique signatures of heart failure in mice and humans', the authors developed a pipeline (OrthoIntegrate) to assign gene orthologs across species and integrate cross-species single-cell RNA-seq data based on Seurat workflows. The authors further compared OrthoIntegrate to other orthologue databases and tools methods and highlighted a better performance of their method. To illustrate the potential applications of OrthoIntegrate, the authors integrated single-cell/single-nuclei RNA-seq data from cardiac tissue of heart failure patients with reduced ejection fraction (HFrEF) and a mouse model mimicking HFrEF using the pipeline. This revealed commonly regulated genes in the disease condition between species (i.e., genes related to cardiomyocyte energy metabolism) and species-specifically regulated genes (i.e., angiogenesis-related genes in humans). Overall, this is a well-designed study with the development of a useful cross-species single-cell data integration pipeline whose applications have been showcased in the context of heart failure (to me it is more like an improved orthologue assignment method)

      A few points need to be addressed before publishing 1. The authors utilized the Needleman-Wunsch algorithm to generate one-to-one orthologs between human genes and mouse genes. What is the advantage of using this algorithm compared to other algorithms i.e., SAMap uses BLAST? 2. The authors have shown the application of OrthoIntegrate in the context of heart failure between mice and humans. Could the authors include at least one more example of using OrthoIntegrate in other disease conditions or between other species to show the versatility of OrthoIntegrate? 3. To assess the quality of clustering after integration, the authors calculated silhouette coefficients/scores and found that integration by OrthoIntegrate resulted in an improved clustering performance. Could the authors include more benchmarking metrics to assess the performance of OrthoIntegrate compared to other methods? The authors could consider metrics like the species mixing score used by BENGAL (Song et al., 2022, biorxiv; https://github.com/Functional-Genomics/BENGAL) 4. Miscalling of figures: silhouette coefficients are shown in Supp_Fig_4 rather than Suppl_Fig_3. 5. Some information on the used datasets in the manuscript has been shown in supplementary table 1, but it's still a bit confusing, for example, where the mouse and human HFrEF datasets come from. I am not exactly sure, but I presume HFrEF datasets are from E-MTAB-13264? This information should be described more explicitly in the method section.

  4. Apr 2024
    1. AbstractBackground Organoids are three-dimensional experimental models that summarize the anatomical and functional structure of an organ. Although a promising experimental model for precision medicine, patient-derived tumor organoids (PDTOs) have currently been developed only for a fraction of tumor types.Results We have generated the first multi-omic dataset (whole-genome sequencing, WGS, and RNA-sequencing, RNA-seq) of PDTOs from the rare and understudied pulmonary neuroendocrine tumors (n = 12; 6 grade 1, 6 grade 2), and provide data from other rare neuroendocrine neoplasms: small intestine (ileal) neuroendocrine tumors (n = 6; 2 grade 1 and 4 grade 2) and large-cell neuroendocrine carcinoma (n = 5; 1 pancreatic and 4 pulmonary). This dataset includes a matched sample from the parental sample (primary tumor or metastasis) for a majority of samples (21/23) and longitudinal sampling of the PDTOs (1 to 2 time-points), for a total of n = 47 RNA-seq and n = 33 WGS. We here provide quality control for each technique, and provide the raw and processed data as well as all scripts for genomic analyses to ensure an optimal re-use of the data. In addition, we report somatic small variant calls and describe how they were generated, in particular how we used WGS somatic calls to train a random-forest classifier to detect variants in tumor-only RNA-seq.Conclusions This dataset will be critical to future studies relying on this PDTO biobank, such as drug screens for novel therapies and experiments investigating the mechanisms of carcinogenesis in these understudied diseases.

      A version of this preprint has been published in the Open Access journal GigaScience (see https://doi.org/10.1093/gigascience/giae008 ), where the paper and peer reviews are published openly under a CC-BY 4.0 license.

      Reviewer Qiuyue Yuan

      The authors conducted a study where they generated multi-omics datasets, including whole-genome sequencing and RNA sequencing , for rare neuroendocrine tumors in the lungs, small intestine, and large cells.

      They used patient-derived tumor organoids and performed quality control analysis on the datasets. Additionally, they developed a random forest classifier specifically for detecting mutations in the RNA-seq data.

      The pipeline used in this study is well-organized, but I have a few queries that I would like to clarify before recommending it for publication.Major concerns:The data processing and quality control procedures would be valuable for other researchers working with similar datasets. It would be beneficial to add these procedures to the GitHub repository (https://github.com/IARCbioinfo/MS_panNEN_organoids).

      Furthermore, it would be helpful to provide insights into what constitutes good quality reads, such as the number of unique reads and the ratio of duplicate reads.Regarding the random forest (RF) model, it is mentioned that there are 10 features. Could you clarify if these features are from the public information, or are all the features extracted solely from the RNA-seq data?

      Also, does the RF model work for WGS data as well?Was there any specific design implemented to address the issue of imbalanced positive and negative samples?RNA-seq are not used to generate the gene expression here, which would waste important information.Minor concerns:In Figure 6C, what does "Mean minimum depth" refer to?Is the most important feature identified by the RF model a good predictor?

    2. Background Organoids are three-dimensional experimental models that summarize the anatomical and functional structure of an organ. Although a promising experimental model for precision medicine, patient-derived tumor organoids (PDTOs) have currently been developed only for a fraction of tumor types.Results We have generated the first multi-omic dataset (whole-genome sequencing, WGS, and RNA-sequencing, RNA-seq) of PDTOs from the rare and understudied pulmonary neuroendocrine tumors (n = 12; 6 grade 1, 6 grade 2), and provide data from other rare neuroendocrine neoplasms: small intestine (ileal) neuroendocrine tumors (n = 6; 2 grade 1 and 4 grade 2) and large-cell neuroendocrine carcinoma (n = 5; 1 pancreatic and 4 pulmonary). This dataset includes a matched sample from the parental sample (primary tumor or metastasis) for a majority of samples (21/23) and longitudinal sampling of the PDTOs (1 to 2 time-points), for a total of n = 47 RNA-seq and n = 33 WGS. We here provide quality control for each technique, and provide the raw and processed data as well as all scripts for genomic analyses to ensure an optimal re-use of the data. In addition, we report somatic small variant calls and describe how they were generated, in particular how we used WGS somatic calls to train a random-forest classifier to detect variants in tumor-only RNA-seq.Conclusions This dataset will be critical to future studies relying on this PDTO biobank, such as drug screens for novel therapies and experiments investigating the mechanisms of carcinogenesis in these understudied diseases.

      A version of this preprint has been published in the Open Access journal GigaScience (see https://doi.org/10.1093/gigascience/giae008 ), where the paper and peer reviews are published openly under a CC-BY 4.0 license.

      Reviewer Saurabh V Laddha

      Alcala et al., did an excellent work on rare cancer type by creating PDTOs molecular fingerprint which has a direct impact for researcher working on these rare cancer type. As a data note, this is excellent resource and covering huge gap in this rare cancer field.These PDTOs holds high impact specially for such cancers which are slow growing and not easy culture in lab. Authors covered details regarding each technique used in this study and figures are clear to understand with exceptional writing.Minor comments:- Did authors compare the PDTOs to tumor molecular dataset ? This will be the key to understand how closely and qualitatively PTDOs are related to actual tumor datasets molecular profile. It is not clear in the current version and it will be helpful to readers to decide whether PTDOs molecular fingerprint system are valuable to them. This is not required for this manuscript to address but a note will be helpful to make valulabe decision to use such resources and with what limitations.- Authors covered longitudinal samples in this system for 1 to 2 timepoints. What changes did they observe (molecularly) looking at this data from a longitudinal timepoints view will be helpful for readers. Also, based on author's experience for longitudinal sampling, do authors have key suggestions for researcher ? a brief discussion will be helpful.- Authors did comprehensive small variant analysis from WGS and RNAseq. Did you authors find known somatic variations for these samples ? mainly comparing against the known published mutational landscape. A note of this will be helpful.- A comment on limitations of PTDOs and molecular fingerprint created from such PDTOs will be valuable.- Authors briefly comment on using such molecular datasets from PDTOs and combining with other datasets to improve on power statistics to discover informative molecular features of these cancers. This points towards my first point on how similar PDTOs are to tumor molecular profile.

    3. Background Organoids are three-dimensional experimental models that summarize the anatomical and functional structure of an organ. Although a promising experimental model for precision medicine, patient-derived tumor organoids (PDTOs) have currently been developed only for a fraction of tumor types.Results We have generated the first multi-omic dataset (whole-genome sequencing, WGS, and RNA-sequencing, RNA-seq) of PDTOs from the rare and understudied pulmonary neuroendocrine tumors (n = 12; 6 grade 1, 6 grade 2), and provide data from other rare neuroendocrine neoplasms: small intestine (ileal) neuroendocrine tumors (n = 6; 2 grade 1 and 4 grade 2) and large-cell neuroendocrine carcinoma (n = 5; 1 pancreatic and 4 pulmonary). This dataset includes a matched sample from the parental sample (primary tumor or metastasis) for a majority of samples (21/23) and longitudinal sampling of the PDTOs (1 to 2 time-points), for a total of n = 47 RNA-seq and n = 33 WGS. We here provide quality control for each technique, and provide the raw and processed data as well as all scripts for genomic analyses to ensure an optimal re-use of the data. In addition, we report somatic small variant calls and describe how they were generated, in particular how we used WGS somatic calls to train a random-forest classifier to detect variants in tumor-only RNA-seq.Conclusions This dataset will be critical to future studies relying on this PDTO biobank, such as drug screens for novel therapies and experiments investigating the mechanisms of carcinogenesis in these understudied diseases.

      A version of this preprint has been published in the Open Access journal GigaScience (see https://doi.org/10.1093/gigascience/giae008 ), where the paper and peer reviews are published openly under a CC-BY 4.0 license.

      Reviewer Masashi Fujita

      In this manuscript, Alcala et al. have reported on the whole genome sequencing (WGS) and RNA sequencing (RNA-seq) of 23 patient-derived tumor organoids of neuroendocrine neoplasms.

      This is a detailed report on the quality control of WGS, RNA-seq, and sample swap. The methods are solid and well-described. The raw sequencing data have been deposited in a public repository. This dataset could be a valuable resource for exploring the biology and treatment of this rare type of tumor.

      Here are my comments to the authors:

      Could you please clarify whether the organoids described in this manuscript will be distributed? If so, could you provide the contact address and any restrictions, such as a material transfer agreement?You have deposited the RNA-seq gene expression matrix in the public repository European Genome-phenome Archive (dataset ID: EGAD00001009994).

      However, the file is under controlled access. This limits the availability of data, especially for scientists who just want a quick glance at the data. Since the gene expression matrix does not contain personally identifiable information, I wonder if you could make the file open access.

      You have reported how you detected somatic mutations in the organoids. However, you did not share the list of detected mutations. Sharing this list would help scientists who do not have a computational background. Open access is preferable in this case, but controlled access is also acceptable because germline variants could be misclassified as somatic.

      The primary site of mLCNEC23 is unknown. Could you infer its primary site based on gene expression patterns or driver mutations?I have concerns about the generalizability of your random forest model because it was trained using only 22 somatic mutations. Could you assess your prediction model using publicly available datasets of cancer genomes (e.g., TCGA)?

  5. Mar 2024
    1. Editors Assessment:

      MPDSCOVID-19 has been developed as a one-stop solution for drug discovery research for COVID-19, running on the Molecular Property Diagnostic Suite (MPDS) platform. This is built upon the open-source Galaxy workflow system, integrating many modules and data specific to COVID-19. Data integrated includes SARS-CoV-2 targets, genes and their pathway information; information on repurposed drugs against various targets of SARS-CoV-2, mutational variants, polypharmacology for COVID-19, drug-drug interaction information, Protein-Protein Interaction (PPI), host protein information, epidemiology, and inhibitors databases. After improvements to the technical description of the platform, testing helped demonstrate the potential to drive open-source computational drug discovery with the platform.

      This evaluation refers to version 1 of the preprint

    2. AbstractComputational drug discovery is intrinsically interdisciplinary and has to deal with the multifarious factors which are often dependent on the type of disease. Molecular Property Diagnostic Suite (MPDS) is a Galaxy based web portal which was conceived and developed as a disease specific web portal, originally developed for tuberculosis (MPDSTB). As specific computational tools are often required for a given disease, developing a disease specific web portal is highly desirable. This paper emphasises on the development of the customised web portal for COVID-19 infection and is referred to as MPDSCOVID-19. Expectedly, the MPDS suites of programs have modules which are essentially independent of a given disease, whereas some modules are specific to a particular disease. In the MPDSCOVID-19 portal, there are modules which are specific to COVID-19, and these are clubbed in SARS-COV-2 disease library. Further, the new additions and/or significant improvements were made to the disease independent modules, besides the addition of tools from galaxy toolshed. This manuscript provides a latest update on the disease independent modules of MPDS after almost 6 years, as well as provide the contemporary information and tool-shed necessary to engage in the drug discovery research of COVID-19. The disease independent modules include file format converter and descriptor calculation under the data processing module; QSAR, pharmacophore, scaffold analysis, active site analysis, docking, screening, drug repurposing tool, virtual screening, visualisation, sequence alignment, phylogenetic analysis under the data analysis module; and various machine learning packages, algorithms and in-house developed machine learning antiviral prediction model are available. The MPDS suite of programs are expected to bring a paradigm shift in computational drug discovery, especially in the academic community, guided through a transparent and open innovation approach. The MPDSCOVID-19 can be accessed at http://mpds.neist.res.in:8085.

      This work has been published in GigaByte Journal under a CC-BY 4.0 license (https://doi.org/10.46471/gigabyte.114), and has published the reviews under the same license. These are as follows.

      Reviewer 1. Prashanth N Suravajhala

      Is there a clear statement of need explaining what problems the software is designed to solve and who the target audience is? Yes. The authors could describe Minimum Information about bioinformatics investigation (MIABI) guidelines. Is the source code available, and has an appropriate Open Source Initiative license been assigned to the code? github and Zenodo, yes!

      I tested git, forked it and as I didn't test the graphical version, ensured all python libraries are working! Is the documentation provided clear and user friendly? Yes. Yes, a white paper could be more friendly! Is there a clearly-stated list of dependencies, and is the core functionality of the software documented to a satisfactory level? Yes. yes with README version! Have any claims of performance been sufficiently tested and compared to other commonly-used packages? Yes, as described by the authors Are there (ideally real world) examples demonstrating use of the software? Yes. The Molecular Property Dynamic Suite (MPDS) is a welcome initiative which would serve chemical space for research community. While the authors aimed to deploy it in Galaxy, there is no Galaxy reference cited in first few introductory lines. A strong rationale on Galaxy-MPDS connect could be a value addition The port 8085/8080 are ephemeral and it would be nice if the authors deploy it on a more permanent base An absolute strength for the suite is availability of source code so that end-users can fine tune and reinstantiate the codes. In the architecture, could the end user have a chance to deploy biopython modules for drug discovery/modeling

      In Page 4, the authors can define what are the tools precisely used in MPDS 2.3 section: The PPI is not abbreviated as first use The results are exploited well for disease dependent/independent use. However, the major challenge for ligand use/preparation is the use of ncRNAs. Could MPDS provide such instances where ncRNAs could be used fpr targeted ligands? L28 in section 4.1: Pluralis for features ( as one of is used) Also a word or two on aadhar card for perhaps naive users may be mentioned and it may be italicized as it may be a domestic word. Does MPDS suite augur well for Anvaya that Government of India launched, or Tavexa or Taverna? A word to two on local setting up of cloud instance may be a nice addition

      Scores on a scale of 0-5 with 5 being the best

      Language: 4 Novelty: 4.5 Brevity: 4 Scope and relevance: 4 Language/Brevity checks: Page 9 L6: fulfill misspelt webserver are two words, IMHO

      Page 10: CADD which IS available

      Tabl S2/S4: from THE coronavirdiae space between anticoronavirusdrugs

      Figure S3: remove OF (identifying OF existing) Supporting information may be corrected High resolution Figures esp GA, Figures 2-4 may be inserted

      Reviewer 2. Abdul Majeed

      Is the language of sufficient quality? Yes. Some changes are needed to make the writing more scientific. Is the code executable? Unable to test Is installation/deployment sufficiently outlined in the paper and documentation, and does it proceed as outlined? Unable to test

      Additional comments: In this paper, the authors introduced a Molecular Property Diagnostic Suite (MPDS), which is a Galaxy-based web portal that was conceived and developed as an open-source disease-specific web portal. MPDS is a customized web portal developed for COVID-19, which is a one-stop solution for drug discovery research. I read the article; it is well-written and well-presented. The enclosed contents can be very useful for researchers working in this field (e.g., COVID-19 systems development). However, I propose some comments/concerns to the current version that need correction during the revision. 1- In the abstract, please provide the technical description of the method’s working. Also, please mention the entities which can benefit from the system. 2- The introduction section doesn’t present the challenges/problems of the existing tools. Please discuss the challenges of the previous such tools and how are they addressed through this new system. 3- I could not find the concrete details of data modalities supported in the system. The authors are advised to include such details. 4- The authors mentioned the use of ML, but I couldn’t find any potential usage of ML models. Please add such analysis during the revision. 5- Also, please add some performance results like time complexity, storage, I/O cost, etc. 6- One comprehensive diagram should be included to better illustrate the working of the proposed tool. 7- Please add limitations of the proposed tool in the revised work. 8- Please add the potential implications of this tool in the context of current/future pandemics.

      Re-review: I have carefully checked the revised work and the author's responses. The authors have made the desired modifications. I have no major concerns on this paper. In the previous review round, Comment #: 3 has not been properly responded by the authors. By data modality, I meant tabular data, graph data, audio data, video data, etc. Authors should add this aspect clearly in the paper about each data modality processed in their system. In Figure 4, some contents (e.g., protein information, PPI interaction, etc.) are unreadable. The abbreviations are not consistently written in terms of small and capital letters. In the paper, the authors are advised to clearly describe the purpose of this tool, who will benefit and in what capacity, why these kinds of tools are needed, etc. I suggest adding such information in abstract to clearly convey the message to readers. In the title, please recheck one word, Open Access or Open Source. The journals are open access while the software are usually open source .

      Reviewer 3. Agastya P Bhati

      Is there a clear statement of need explaining what problems the software is designed to solve and who the target audience is? Yes. As noted in my comments, it would be beneficial to clarify what new capabilities are provided by this new portal over and above what is already available currently. Is the source code available, and has an appropriate Open Source Initiative license been assigned to the code? No. There is a github repository (https://github.com/gnsastry/MPDS-18Compound_Library), however, I am unable to access it currently. As Open Source Software are there guidelines on how to contribute, report issues or seek support on the code? Yes. A github repository provides such capabilities. However, it is inaccessible currently. Is the code executable? Unable to test Is installation/deployment sufficiently outlined in the paper and documentation, and does it proceed as outlined? Unable to test Have any claims of performance been sufficiently tested and compared to other commonly-used packages? Is automated testing used or are there manual steps described so that the functionality of the software can be verified? No Additional comments: Molecular Property Diagnostic Suite for COVID-19 (MPDSCOVID19) is an open-source disease specific web portal aiming to provide a collection of all tools and databases relevant for COVID-19 that are available online along with a few in-house scripts at a single portal. It is built upon another platform called "Galaxy" that provides similar services for data intensive biomedical research. MPDSCOVID19 is in continuation to two other similar disease-specific portals that this group has published earlier - for Tuberculosis and Diabetes mellitus. Overall, MPDSCOVID19 is an interesting and useful resource that could be helpful for biomedical community in conducting COVID-19 related research. It brings together all the databases and relevant tools that may make a researcher's life easier as exemplified through the various case studies included.

      I recommend publishing this article after the following revisions noted. Please note that any mention of page numbers below is referring to the reviewer PDF version.

      Major revisions:

      (1) One main issue in this manuscript is the lack of a clear description of the "new" capabilities provided by MPDSCOVID19 over and above what Galaxy provides. I think a clear distinction between the capabilities/features of Galaxy and MPDSCOVID19 would help improve the manuscript substantially and help readers better understand the capabilities of this new COVID-19 portal.

      Further, a description of the additions in the new portal over the earlier TB and Diabetes portals is mentioned on page 7. However, I think more details on such advancements/additions would be beneficial. It could be in the form of a table.

      (2) It is mentioned that a major advancement in this new portal is the inclusion of ML/AI models/approaches, however no details have been provided. It would be beneficial to briefly describe what ML based capabilities are included in MPDS and how they can be used by any general user. An additional case study demonstrating the same would be helpful.

      (3) MPDS portal provides a collection of tools and databases for COVID-19. However, such resources are ever-growing and hence constant updating of the portal's capabilities/resources would be a necessary requirement for its sustainability. There is no mention of any such plans. Do authors have a modus operandi for the same? Have there been further releases of the previous MPDS portals for TB and Diabetes that may be relevant here?

      (4) Page 6 - lines 3-4: I suggest replacing "are going to witness" with "are witnessing". There are several recent advancements in applying ML/AI based approaches to improve different aspects of drug discovery. I recommend including a few references here to this effect. Below are some relevant examples:

      (a) 10.1021/acs.jcim.0c00915 (b) 10.1021/acs.jcim.1c00851 (c) 10.1038/s41598-023-28785-9 (d) 10.1098/rsfs.2021.0018 (e) 10.1145/3472456.3473524 (f) 10.1145/3468267.3470573

      (5) Page 7 - line 8: I am assuming that the terms like "updates", "additions", etc., used in this paragraph are comparing MPDS with its older versions. If so, it would be beneficial to clarify this explicitly. In addition, I suggest that the authors include a brief literature survey to describe what other tools and/or webservers are available already and how MPDS compares with them. This has not been done so far.

      (6) The github repository is currently inaccessible publicly. This needs rectification.

      Minor revisions:

      (1) Page 4: Before introducing MPDSCOVID19 it makes sense to briefly describe Galaxy and its main features. For instance moving forward lines 19-20 (page 4) and lines 3-6 (page 5) to line 12 (page 4).

      (2) Page 5 - line 22: I suggest that authors mention the total number of databases/servers that are covered by MPDSCOVID19 as of now. From Table S1, it appears that there are 15 currently (items 5 and 7 are repeated so the 13 seems the wrong total - needs rectification).

      (3) Page 5 - line 30: It would make sense to specify details of the MPDS local server. For instance, how many cores/GPUs are available and what are their hardware architectures? Also, it would be beneficial for the users to know if it is possible to use MPDS tools on their own or public infrastructures for large scale implementations. I suggest authors comment on this aspect too.

      (4) Page 6 - lines 16-19: The sentence "Galaxy platform.......extend the availability." needs some rephrasing. It is too long and the hard to comprehend.

      (5) Page 7 - line 18: I don't understand the word "colloids". Please clarify.

      (6) Page 8 - line 30: "section 2.3" is referred to but I don't see any section numbering the PDF provided. This needs rectification.

      Re-review: I am satisfied with the changes made to the manuscript and recommend publishing it in its current form if all other reviewers are happy with that.

    1. Editors Assessment:

      This Data Release paper presents an updated genome assembly of the doubled haploid perennial ryegrass (Lolium perenne L.) genotype Kyuss (Kyuss v2.0). To correct for structural the authors de novo assembled the genome again with ONT long-reads and generated 50-fold coverage high-throughput chromosome conformation capture (Hi-C) data to assist pseudo-chromosome construction. After being asked for some more improvements to gene and repeat annotation the authors now demonstrate the new assembly is more contiguous, more complete, and more accurate than Kyuss v1.0 and shows the correct pseudo-chromosome structure. This more accurate data have great potential for downstream genomic applications, such as read mapping, variant calling, genome-wide association studies, comparative genomics, and evolutionary biology. These future analyses being able to benefit forage and turf grass research and breeding.

      This evaluation refers to version 1 of the preprint

    2. ABSTRACTThis work is an update and extension of the previously published article “Ultralong Oxford Nanopore Reads Enable the Development of a Reference-Grade Perennial Ryegrass Genome Assembly”, by Frei et al.. The published genome assembly of the doubled haploid perennial ryegrass (Lolium perenne L.) genotype Kyuss marked a milestone for forage grass research and breeding. However, order and orientation errors may exist in the pseudo-chromosomes of Kyuss, since barley (Hordeum vulgare L.), which diverged 30 million years ago from perennial ryegrass, was used as the reference to scaffold Kyuss. To correct for structural errors possibly present in the published Kyuss assembly, we de novo assembled the genome again and generated 50-fold coverage high-throughput chromosome conformation capture (Hi-C) data to assist pseudo-chromosome construction. The resulting new chromosome-level assembly showed improved quality with high contiguity (contig N50 = 120 Mb), high completeness (total BUSCO score = 99%), high base-level accuracy (QV = 50) and correct pseudo-chromosome structure (validated by Hi-C contact map). This new assembly will serve as a better reference genome for Lolium spp. and greatly benefit the forage and turf grass research community.

      This work has been published in GigaByte Journal under a CC-BY 4.0 license (https://doi.org/10.46471/gigabyte.112), and has published the reviews under the same license. These are as follows.

      Reviewer 1. Qing Liu

      This updated double haploid perennial ryegrass (Lolium perenne L.) showed contig N50 of 120 Mb, total BUSCO score=99%, which verified that the improved assembly can serve a reference for Lolium species using 50-fold coverage Hi-C data. The article is well edited except for below revision points. The minor revision is suggested for the current version. 1 Please elucidate the Kyuss v2.0, whether its reference is the same as Kyuss v1.0, if same or separate reference please elucidate. 2 In Table 3 of page 6, What the repeat element number for each family, could authors listed in number and proportion in order to clear the family category, for example, is the number of rolling-circles the same for Heltrons? 3 Tandem repeat or satellite or centromere location data, could author provide for the updated assembly of the Lolium species. 4 For Figure 1, what the heterozygosity and k-mer estimated genome size, I can’t find the data. 5 In Figure 3A, lowercase letter a, b, c , d and e are suggested to subsittute the A, B, C, D and E in order to avoid Figure 3A and Figure 3AA

      Reviewer 2. Istvan Nagy

      Are all data available and do they match the descriptions in the paper? No. Minor revision in the manuscript body is suggested. Gene annotation and repeat annotation data need some minor revision) See details in the "Additional Comments" section. Additional Comments: The submitted dataset reports and improved chromosome-level assembly and annotation of the doubled-haploid line Kyuss of Lolium perenne. The present v2.0 assembly is showing significant improvements as compared to the Kyuss v1.0 assembly published by the same group in 2021: The new assembly incorporates 99% of the estimated genome size in seven pseudo-chromosomes and the >99% BUSCO completeness of the gene space is also impressive.

      Below are mine remarks and suggestions to the present version of manuscript:

      Genome assembly and polishing It's indicated that for the primary assembly of the present work the same source of ONT reads were used as for the previous Kyuss v1.0 assembly. However, in the present manuscript the authors report clearly better assembly quality as opposed to the Kyuss v1.0 assembly. The question remains open, whether the authors achieved better results by changing/optimizing the primary assembly parameters, and/or applying a step-wise, iterating strategy with repeated rounds of long-read and short-read corrections? By any means, a more detailed description/specification of assembly parameters would be desirable.

      Genome annotation In the provided annotation file "kyuss_v2.gff" in the majority of cases gene IDs consisting of the reference chromosome ID and of an ongoing number, like "KYUSg_chr1.188" are used. However in a few cases gene IDs like "KYUSt_contig_1275.207" are also used. This inconsistency might create confusions for future users of Kyuss_2 resources, and while the later type of gene IDs might be useful for internal usage, they became meaningless, as instead of contigs now pseudo-chromosomes (and some unplaced scaffolds) are used as references. The authors should modified the gff files and use a consistent naming scheme for all genes. Further, transcript DNA sequences as well as transcript protein sequences with consistent naming schemes should also be provided.

      Repeat annotation The authors should modify Table 3 by specifying and breaking down repeat categories according to the Unified Classification System of transposable elements, by giving Order and Superfamily specifications (like LTR/Gipsy and LTR/Copia etc, in accord with the provided gff file "kyuss_v2_repeatmask.gff").

      According to the provided repeat annotation BED file, more than 750K repeat features have been annotated on the Kyuss_2 genome. Of these repeat features 57815 are overlapping with gene features and 25843 of these overlaps are longer than 100 bp. This indicate that a substantial portion of the 38765 annotated genes might represent sequences coding for transposon proteins and/or transposon related ORFs. I suggest that the authors revise the gene annotation data (and at least remove gene annotation entries that show ~100% overlap with repeat features).

      Assembly quality assessment "The quality score(QV) estimated by Polca for Kyuss v2.0 was 50, suggesting a 99.999% base-level accuracy with the probability of one sequencing error per 100 kb. The estimated accuracy of Kyuss v1.0 is 99.990% (QV40, Table 1), which is 10 times lower than Kyuss v2.0, suggesting that Kyuss v2 is more accurate than Kyuss v1.0." In my opinion, this sentence needs clarification as readers might have difficulties to properly interpret this - especially considering the facts that the same long-read data was used for both for the v1 as well a for the v2 assembly versions, the short-read mapping rate was the same (99.55%) for both versions and the K-mer completeness analysis results differed only slightly (99,39% vs. 99.48%).

    1. We believe citizen science has the potential to promote human and nature connection in urban areas and provide useful data on urban biodiversity.
    1. Editors Assessment:

      This is a Data Release paper describing data sets derived from the Pomar Urbano project cataloging edible fruit-bearing plants in Brazil. Including data sourced from the citizen science iNaturalist app, tracking the distribution and monitoring of these plants within urban landscapes (Brazilian state capitals). The data was audited and peer reviewed and put into better context, and there is a companion commentary in GigaScience journal better explaining the rationale for the study. Demonstrating this data providing a platform for understanding the diversity of fruit-bearing plants in select Brazilian cities and contributing to many open research questions in the existing literature on urban foraging and ecosystem services in urban environments.

      This evaluation refers to version 1 of the preprint

    2. AbstractThis paper presents two key data sets derived from the Pomar Urbano project. The first data set is a comprehensive catalog of edible fruit-bearing plant species, native or introduced in Brazil. The second data set, sourced from the iNaturalist platform, tracks the distribution and monitoring of these plants within urban landscapes across Brazil. The study encompasses data from all 27 Brazilian state capitals, focusing on the ten cities that contributed the most observations as of August 2023. The research emphasizes the significance of citizen science in urban biodiversity monitoring and its potential to contribute to various fields, including food and nutrition, creative industry, study of plant phenology, and machine learning applications. We expect the data sets to serve as a resource for further studies in urban foraging, food security, cultural ecosystem services, and environmental sustainability.

      This work has been published in GigaByte Journal under a CC-BY 4.0 license (https://doi.org/10.46471/gigabyte.108 and see also the accompanying commentary in GigaScience: https://doi.org/10.1093/gigascience/giae007 ), and has published the reviews under the same license as follows:

      Reviewer 1. Corey T. Callaghan

      Are the data and metadata consistent with relevant minimum information or reporting standards?

      Yes. More information should be given on the relevance to GBIF. And why the dataset is necessary to 'stand alone'. The main reason I guess is because in this context cultivated organisms are really valuable as a lot of your target organisms will indeed be cultivated.

      Is the data acquisition clear, complete and methodologically sound?

      No. More detail should be provided about the difference in research grade and cultivated organisms on iNaturalist. The RG could be downloaded from GBIF, but I understand the need to go around that given that the cultivated organisms are also valuable in this context.

      Is the validation suitable for this type of data?

      No. There should be more information provided on the CV model. And more information provided on the importance of identifiers in iNaturalist ecosystem. They are critically important. Right now, it reads as if the CV model generally accurately identifies organisms, but this isn't necessarily true, and there is no reference given. However, the identifiers are necessary to help data processing and identification of the organisms submitted to iNaturalist. I also think the biases of cultivated organisms not being identified as readily by iNaturalist identifiers should be discussed somewhere in the manuscript.

      Additional Comments:

      I appreciated the description of this dataset and particularly liked the 'context' section and think it did a good job of setting up the need for such data. I would use iNaturalist throughout as opposed to iNat since iNat is a bit more colloquial.

      Reviewer 2. Patrick Hurley

      This is a very interesting paper and approach to examining questions related to the presence of edible plants in Brazilian cities. As such, it addresses--whether intentionally or not--open questions within the existing literatures of urban foraging and urban ecosystem services (Shackleton et al. 2017, ), among others, including:

      1. how the existing species composition of cities create already existing edible/useful landscapes (see Hurley et al. 2015, Hurley and Emery 2018, Hurley et al. 2022), or what the authors appear to describe as "orchards", and including the use of open data sources to support these activities (Stark et al. 2019),
      2. the ways that urban forests support cultural ecosystem services (Plieininger et al. 2015), 2a. dietary need/food security (Synk et al. 2017, Bunge et al. 2019, Gaither et al. 2020, Sardeshpande & Shackleton 2023), including in Brazil (Brito et al 2020), and diversity (Gareake & Shackleton 2020), 2b. sharing of ecological knowledge (Landor-Yamagata 2018), and 2c. social-ecological resilience (Sardeshpande et al. 2021) as well as 2d. reconnect urban residents to nature/biodiversity (Palliwoda et al. 2017, Fisher and Kowarik 2020, Schunko and Brandner 2022).

      3. I note that while most of the literatures above focus on foods and edibility, Hurley et al. 2015 and Hurley and Emery consider the relationship of urban forests for other, not food-related uses and thus the material connections and uses by people within art and other cultural objects.

      4. I also note that some scholars are beginning to focus on the question of urban governance and the inclusion of urban fruit trees (Kowalski & Conway 2023), building off of the rapidly expanding literature on urban food forestry (Clark and Nicholas 2011) and edible green infrastructure. The difference between these literatures and those I've suggested above is that they generally focus on policy and planting interventions to insert, add, or otherwise enhance the edibility of these spaces (as opposed to the above stream analyzing how people interact with what is already there, whether those species are intended for harvest by people, or not, and thus it seems like this piece better links to those issues .

      5. It would be helpful to see at least some of these links between the present research and its focus on methods for using a particularly valuable dataset linked to/with efforts to address the conceptual questions that are raised by the authors. For example, in relation to item #1 above, I might suggest dropping the use of "orchard" and describe the species being analyzed as representative of an "actually existing food forests" within these cities (building on the existing literature Items 1 through 3), while indicating the insights it might provide to those interested in interventions to shape future cities and their species composition to enhance human benefits (items 4 and 5). Likewise, it would be helpful to reference the items in 2a through 2d where they appear in the Context section, building on the very high level citations already (e.g., current citations #5 FAO and #6 Salbitano).

      To be clear, much of what I'm asking for here can be, I think, addressed through additions of single sentences or phrases throughout the context section, along with brief reference to these within the brief discussions under "Reuse Potenial".

      Or perhaps this is too in-depth for this journal. If that's the case, then I do think that reference to several key articles is needed, specifically to signal the insights this piece has for this ongoing work to understand how urban forests function for human benefit. Those would be:

      Shackleton et al. 2017, Hurley & Emery 2018, Garekea & Shackleton 2020, Fisher & Kowarik 2020, Sardeshpande et al. 2021.

      Most critically, the work of Stark et al. 2019 should be acknowledged.

      My sincere thanks to the authors to learn from this work and my apologies for the delay in completing this review.

      Works Cited Above

      Bunge, A., Diemont, S. A., Bunge, J. A., & Harris, S. (2019). Urban foraging for food security and sovereignty: quantifying edible forest yield in Syracuse, New York using four common fruit-and nut-producing street tree species. Journal of Urban Ecology, 5(1), juy028.

      Fischer, L. K., & Kowarik, I. (2020). Connecting people to biodiversity in cities of tomorrow: Is urban foraging a powerful tool?. Ecological Indicators, 112, 106087.

      Garekae, H., & Shackleton, C. M. (2020). Foraging wild food in urban spaces: the contribution of wild foods to urban dietary diversity in South Africa. Sustainability, 12(2), 678.

      Hurley, P. T., Emery, M. R., McLain, R., Poe, M., Grabbatin, B., & Goetcheus, C. L. (2015). Whose urban forest? The political ecology of foraging urban nontimber forest products. Sustainability in the global city: Myth and practice, 187-212.

      Hurley, P. T., & Emery, M. R. (2018). Locating provisioning ecosystem services in urban forests: Forageable woody species in New York City, USA. Landscape and Urban Planning, 170, 266-275.

      Hurley, P. T., Becker, S., Emery, M. R., & Detweiler, J. (2022). Estimating the alignment of tree species composition with foraging practice in Philadelphia's urban forest: Toward a rapid assessment of provisioning services. Urban Forestry & Urban Greening, 68, 127456.

      Kowalski, J. M., & Conway, T. M. (2023). The routes to fruit: Governance of urban food trees in Canada. Urban Forestry & Urban Greening, 86, 128045.

      Landor-Yamagata, J. L., Kowarik, I., & Fischer, L. K. (2018). Urban foraging in Berlin: People, plants and practices within the metropolitan green infrastructure. Sustainability, 10(6), 1873.

      Palliwoda, J., Kowarik, I., & von der Lippe, M. (2017). Human-biodiversity interactions in urban parks: The species level matters. Landscape and Urban Planning, 157, 394-406.

      Plieninger, T., Bieling, C., Fagerholm, N., Byg, A., Hartel, T., Hurley, P., ... & Huntsinger, L. (2015). The role of cultural ecosystem services in landscape management and planning. Current Opinion in Environmental Sustainability, 14, 28-33.

      Sardeshpande, M., Hurley, P. T., Mollee, E., Garekae, H., Dahlberg, A. C., Emery, M. R., & Shackleton, C. (2021). How people foraging in urban greenspace can mobilize social–ecological resilience during Covid-19 and beyond. Frontiers in Sustainable Cities, 3, 686254.

      Sardeshpande, M., & Shackleton, C. (2023). Fruits of the city: The nature, nurture and future of urban foraging. People and Nature, 5(1), 213-227.

      Schunko, C., & Brandner, A. (2022). Urban nature at the fingertips: Investigating wild food foraging to enable nature interactions of urban dwellers. Ambio, 51(5), 1168-1178.

      Shackleton, C. M., Hurley, P. T., Dahlberg, A. C., Emery, M. R., & Nagendra, H. (2017). Urban foraging: A ubiquitous human practice overlooked by urban planners, policy, and research. Sustainability, 9(10), 1884.

      Stark, P. B., Miller, D., Carlson, T. J., & De Vasquez, K. R. (2019). Open-source food: Nutrition, toxicology, and availability of wild edible greens in the East Bay. PLoS One, 14(1), e0202450.

      Synk, C. M., Kim, B. F., Davis, C. A., Harding, J., Rogers, V., Hurley, P. T., ... & Nachman, K. E. (2017). Gathering Baltimore’s bounty: Characterizing behaviors, motivations, and barriers of foragers in an urban ecosystem. Urban Forestry & Urban Greening, 28, 97-102.

    1. AbstractImportant tasks in biomedical discovery such as predicting gene functions, gene-disease associations, and drug repurposing opportunities are often framed as network edge prediction. The number of edges connecting to a node, termed degree, can vary greatly across nodes in real biomedical networks, and the distribution of degrees varies between networks. If degree strongly influences edge prediction, then imbalance or bias in the distribution of degrees could lead to nonspecific or misleading predictions. We introduce a network permutation framework to quantify the effects of node degree on edge prediction. Our framework decomposes performance into the proportions attributable to degree and the network’s specific connections. We discover that performance attributable to factors other than degree is often only a small portion of overall performance. Degree’s predictive performance diminishes when the networks used for training and testing—despite measuring the same biological relationships—were generated using distinct techniques and hence have large differences in degree distribution. We introduce the permutation-derived edge prior as the probability that an edge exists based only on degree. The edge prior shows excellent discrimination and calibration for 20 biomedical networks (16 bipartite, 3 undirected, 1 directed), with AUROCs frequently exceeding 0.85. Researchers seeking to predict new or missing edges in biological networks should use the edge prior as a baseline to identify the fraction of performance that is nonspecific because of degree. We released our methods as an open-source Python package (https://github.com/hetio/xswap/).

      This work has been peer reviewed in GigaScience (https://doi.org/10.1093/gigascience/giae001), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer 2: Linlin Zhuo

      In this manuscript, the authors introduce a network permutation framework to quantify the effects of node degree on edge prediction. The importance of degree in the edge detection task is self-evident, and the quantification of this effect is undoubtedly groundbreaking. The experimental results on a variety of datasets demonstrate the advanced nature of the method proposed by the authors. However, some parts require further explanation from the authors and can be considered for acceptance in a later stage.

      1.The imbalance of the degree distribution has a significant impact on the results of the edge detection task. In this manuscript, the author proposes a framework to quantify this impact. It is important to note that the manuscript does not explicitly mention the specific form in which the quantification is reflected, such as whether it is presented as an indicator or in another form. Therefore, further explanation from the author is needed to clarify this aspect.

      2.The authors propose that researchers employ marginal priors as a reference point to discern the contributions attributed to node degree from those arising from specific performance. It would be helpful if the authors could elaborate further on the methodology or provide a sample demonstration to clarify the implementation of this approach.

      3.For the XSwap algorithm, I wonder that if the authors could provide a more detailed explanation of its workings, including a step-by-step implementation of the improved XSwap. Furthermore, it would be beneficial if the authors could highlight the significance of the improved XSwap algorithm in biomedical tasks.

      4.The author presents the pseudocode of the XSwap algorithm in Figure 2, along with the improved pseudocode after the author's enhancements. Both pseudocodes are accompanied by explanatory text. However, I believe that expressing them in the form of a figure would make it more visually appealing and intuitive.

      5.The authors introduce the edge prior to quantify the probability of two nodes being connected solely based on their degree. I request the authors to provide a detailed explanation of the specific implementation of the edge prior.

      6.In the "Prediction tasks" section, the author utilizes three prediction tasks to assess the performance of the edge prior. It is recommended to segment correctly for better display of the content.

      7.The focus of the article might not be prominent enough. It is advisable for the author to provide further elaboration on the advanced nature of the proposed framework and its significance in practical tasks. This would help emphasize the main contributions of the research and its relevance in real-world applications.

    2. AbstractImportant tasks in biomedical discovery such as predicting gene functions, gene-disease associations, and drug repurposing opportunities are often framed as network edge prediction. The number of edges connecting to a node, termed degree, can vary greatly across nodes in real biomedical networks, and the distribution of degrees varies between networks. If degree strongly influences edge prediction, then imbalance or bias in the distribution of degrees could lead to nonspecific or misleading predictions. We introduce a network permutation framework to quantify the effects of node degree on edge prediction. Our framework decomposes performance into the proportions attributable to degree and the network’s specific connections. We discover that performance attributable to factors other than degree is often only a small portion of overall performance. Degree’s predictive performance diminishes when the networks used for training and testing—despite measuring the same biological relationships—were generated using distinct techniques and hence have large differences in degree distribution. We introduce the permutation-derived edge prior as the probability that an edge exists based only on degree. The edge prior shows excellent discrimination and calibration for 20 biomedical networks (16 bipartite, 3 undirected, 1 directed), with AUROCs frequently exceeding 0.85. Researchers seeking to predict new or missing edges in biological networks should use the edge prior as a baseline to identify the fraction of performance that is nonspecific because of degree. We released our methods as an open-source Python package (https://github.com/hetio/xswap/).

      This work has been peer reviewed in GigaScience (https://doi.org/10.1093/gigascience/giae001), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer 1: Babita Pandey

      The manuscript "The probability of edge existence due to node degree: a baseline for network-based predictions" presents novel work. But some of the sections are written very briefly, so it is difficult to understand. The section that needs revision are: Degree-grouping, The edge prior encapsulates degree, Degree can underly a large fraction of performance and Analytical approximation of the edge prior. The result section needs revision.

      Some other concerns are: Academic adhar, Jaccard coefficient, preferential atachment etc are link prediction methods. Why auther has termed them as edge prediction features.

  6. Feb 2024
    1. Editors Assessment:

      One limiting factor in the adoption of spatial omics research are workflow systems for data preprocessing, and to address these authors developed the SAW tool to process Stereo-seq data. The analysis steps of spatial transcriptomics involve obtaining gene expression information from space and cells. Existing tools face issues with large data sets, such as intensive spatial localization, RNA alignment, and excessive memory usage. These issues affect the process's applicability and efficiency. To address this, this paper presents a high-performance open-source workflow called SAW for Stereo-Seq. This includes mRNA position reconstruction, genome alignment, matrix generation, clustering, and result file generation for personalized analysis. During review the authors have added examples of MID correction in the article to make the process easier to understand. And In the future, more accurate algorithms or deep learning models may further improve the accuracy of this pipeline.

      *This evaluation refers to version 1 of the preprint *

    2. AbstractThe basic analysis steps of spatial transcriptomics involve obtaining gene expression information from both space and cells. This process requires a set of tools to be completed, and existing tools face performance issues when dealing with large data sets. These issues include computationally intensive spatial localization, RNA genome alignment, and excessive memory usage in large chip scenarios. These problems affect the applicability and efficiency of the process. To address these issues, a high-performance and accurate spatial transcriptomics data analysis workflow called Stereo-Seq Analysis Workflow (SAW) has been developed for the Stereo-Seq technology developed by BGI. This workflow includes mRNA spatial position reconstruction, genome alignment, gene expression matrix generation and clustering, and generate results files in a universal format for subsequent personalized analysis. The excutation time for the entire analysis process is ∼148 minutes on 1G reads 1*1 cm chip test data, 1.8 times faster than unoptimized workflow.

      This work has been published in GigaByte Journal under a CC-BY 4.0 license (https://doi.org/10.46471/gigabyte.111) as part of our Spatial Omics Methods and Applications series (https://doi.org/10.46471/GIGABYTE_SERIES_0005), and has published the reviews under the same license as follows:

      Reviewer 1. Zexuan Zhu

      It would be helpful if some examples can be provided to illustrate the key steps, e.g., the gene region annotation process and MID correction. Some information of the references is missing. Please carefully check the format of the references.

      Decision: Minor Revision

      Reviewer 2. Yanjie Wei

      In this manuscript, the authors introduce a comprehensive Stereo-seq spatial transcriptomics analysis workflow, termed SAW. This workflow encompasses mRNA spatial position reconstruction, genome alignment, gene expression matrix generation, and clustering, culminating in the production of universally formatted results files for subsequent personalized analysis. SAW is particularly optimized for large field Stereo-seq spatial transcriptomics.
      

      The authors provide an in-depth elucidation of SAW's workflow and the optimization techniques employed for each module. However, several aspects warrant further discussion:

      1. The authors outline a strategy to reduce memory consumption during the mapping of CID tagged reads to corresponding coordinates by partitioning the mask file and fastq files. The manuscript, however, lacks a detailed description of how these files are divided. It would be beneficial if the authors could furnish additional information regarding this partitioning method.

      2. The gene expression matrix, a crucial output of the SAW process, lacks sufficient evaluation to substantiate its accuracy. The count tool generates this matrix through three primary steps: gene region annotation, MID correction, and MID deduplication. During the gene annotation phase, a hard threshold (50% of the read overlapping with exon) is used to determine if a read is exonic. The basis for this threshold, however, remains unclear.

      3. In the testing section, the authors evaluated the workflow on 2 S1 chips with approximately 1 million reads. The optimized workflow demonstrated a 1.8-fold speed increase compared to the non-optimized version. Table 2 only presents the total runtime before and after optimization. It would be advantageous if the authors could enrich this table by including the runtime of critical modules, such as read mapping, which accounts for 70% of the total runtime.

    1. ABSTRACTStereo-seq is a cutting-edge technique for spatially resolved transcriptomics that combines subcellular resolution with centimeter-level field-of-view, serving as a technical foundation for analyzing large tissues at the single-cell level. Our previous work presents the first one-stop software that utilizes cell nuclei staining images and statistical methods to generate high-confidence single-cell spatial gene expression profiles for Stereo-seq data. With recent advancements in Stereo-seq technology, it is possible to acquire cell boundary information, such as cell membrane/wall staining images. To take advantage of this progress, we update our software to a new version, named STCellbin, which utilizes the cell nuclei staining images as a bridge to align cell membrane/wall staining images with spatial gene expression maps. By employing an advanced cell segmentation technique, accurate cell boundaries can be obtained, leading to more reliable single-cell spatial gene expression profiles. Experimental results verify that STCellbin can be applied on the mouse liver (cell membranes) and Arabidopsis seed (cell walls) datasets and outperforms other competitive methods. The improved capability of capturing single cell gene expression profiles by this update results in a deeper understanding of the contribution of single cell phenotypes to tissue biology.

      This work has been published in GigaByte Journal under a CC-BY 4.0 license (https://doi.org/10.46471/gigabyte.110) as part of our Spatial Omics Methods and Applications series (https://doi.org/10.46471/GIGABYTE_SERIES_0005), and has published the reviews under the same license as follows:

      Reviewer 1. Chunquan Li

      Stereo-seq, an advanced spatial transcriptomics technique, allows detailed analysis of large tissues at the single-cell level with precise subcellular resolution. Author's prior software was groundbreaking, generating robust single-cell spatial gene expression profiles using cell nuclei staining images and statistical methods. They've enhanced their software to STCellbin, using cell nuclei images to align cell membrane/wall staining images. This update employs improved cell segmentation, ensuring accurate boundaries and more dependable single-cell spatial gene expression profiles. Successful tests on mouse liver and Arabidopsis seed datasets demonstrate STCellbin's effectiveness, enabling a deeper insight into the role of single-cell characteristics in tissue biology. However, I do have some suggestions and questions about certain parts of the manuscript. 1. The authors should show the advantages and performance of STCellbin compared to other methods, such as its computational efficiency, accuracy, and suitability for various image types. 2. To comprehensively assess the performance of STCellbin, the authors should consider integrating other commonly used cell segmentation evaluation metrics, such as F1-score, Dice coefficient, and so forth. 3. To ensure the completeness and reproducibility of the data analysis, more detailed information regarding the clustering analysis of the single-cell spatial gene expression maps generated through STCellbin is requested. This information should encompass methods, parameters, and results such as cluster type annotations. 4. The authors can use simpler and clearer language and terminology to describe the image registration process in the methods section, ensuring that readers can easily understand the workflow and principles of image registration.

      Reviewer 2. Zhaowei Wang

      In this manuscript, the authors propose STCellbin to generate single-cell gene expression profiles for high-resolution spatial transcriptomics based on cell boundary images. The experiment results on mouse liver and Arabidopsis seed datasets prove the good performance of STCellbin. The topic is significant and the proposed method is feasible. However, there are still some concerns and problems to be improved and clarified.
      

      (1) STCellbin is an update version of StereoCell, but the explanation of StereoCell is not sufficient. The authors should give a more detailed explanation of StereoCell, such as its input and main process. (2) The authors list some existing dyeing methods in Lines 52-53, Page 3. They should clarify that these methods are used for nuclei staining, which differentiate them from the cell membrane/wall staining methods of following content. It can provide a more accurate explanation for readers and users. (3) The authors share the GitHub repository of STCellbin, and I noticed that when executing STCellbin, the input only requires the path of image data and spatial gene expression data, the path of the output results, and the chip number. Are there other adjustable parameters? (4) In Page 5, Line 85, “steps” should be “step”, and in Page 8, Line 145, “must” would be better revised to “should”. Moreover, the writing of “stained image” and “staining image” should be consistent.

    2. Editors Assessment:

      This paper describes a new spatial transcriptomics method that that utilizes cell nuclei staining images and statistical methods to generate high-confidence single-cell spatial gene expression profiles for Stereo-seq data. STCellbin is an update of StereoCell, now using a more advanced cell segmentation technique, so more accurate cell boundaries can be obtained, allowing more reliable single-cell spatial gene expression profiles to be obtained. After peer review more comparisons were added and more description given on what was upgraded in this version to convince the reviewers. Demonstrating it is a more reliable method, particularly for analyzing high-resolution and large-field-of-view spatial transcriptomic data. And extending the capability to automatically process Stereo-seq cell membrane/wall staining images for identifying cell boundaries.

      This evaluation refers to version 2 of the preprint

    1. Editors Assessment:

      For better data quality assessment of large spatial transcriptomics datasets this new BatchEval software has been developed as a batch effect evaluation tool. This generates a comprehensive report with assessment findings, including basic information of integrated datasets, a batch effect score, and recommended methods for removing batch effects. The report also includes evaluation details for the raw dataset and results from batch effect removal methods. Through peer review and clarification of a number of points it now looks convincing that this tool helps researchers identify and remove batch effects, ensuring reliable and meaningful insights from integrated datasets. Potentially making the tool valuable for researchers who need to analyze large datasets of this type, as it provides an easy and reliable way to assess data quality and ensures that downstream analyses are robust and reliable.

      This evaluation refers to version 1 of the preprint

    2. ABSTRACTAs genomic sequencing technology continues to advance, it becomes increasingly important to perform joint analyses of multiple datasets of transcriptomics. However, batch effect presents challenges for dataset integration, such as sequencing data measured on different platforms, and datasets collected at different times. Here, we report the development of BatchEval Pipeline, a batch effect workflow used to evaluate batch effect on dataset integration. The BatchEval Pipeline generates a comprehensive report, which consists of a series of HTML pages for assessment findings, including a main page, a raw dataset evaluation page, and several built-in methods evaluation pages. The main page exhibits basic information of the integrated datasets, a comprehensive score of batch effect, and the most recommended method for removing batch effect from the current datasets. The remaining pages exhibit evaluation details for the raw dataset, and evaluation results from the built-in batch effect removal methods after removing batch effect. This comprehensive report enables researchers to accurately identify and remove batch effects, resulting in more reliable and meaningful biological insights from integrated datasets. In summary, the BatchEval Pipeline represents a significant advancement in batch effect evaluation, and is a valuable tool to improve the accuracy and reliability of the experimental results.

      This work has been published in GigaByte Journal under a CC-BY 4.0 license (https://doi.org/10.46471/gigabyte.108) as part of our Spatial Omics Methods and Applications series (https://doi.org/10.46471/GIGABYTE_SERIES_0005), and has published the reviews under the same license as follows:

      **Reviewer 1. Chunquan Li **

      1. Page 1, Lines 14-16. The authors indicate that “it is crucial to thoroughly investigate the batch effects in the dataset before integrating and processing the data”. The term “thoroughly” may be not accurate enough. The current method can alleviate the batch effects, but it can’t thoroughly solve the related problems. In addition, this work proposes a batch evaluation tool, such “reasonably evaluate the batch effects” may be more accurate than “thoroughly investigate the batch effects”.
      2. In Figure 1, does the first box is “integrated datasets”?
      3. Page 5, Line 168, and Page 6, Lines 169-175, the content of these two paragraphs is similar, with some redundant descriptions. It is recommended to organize and write them into one paragraph.
      4. There is Table 1 in the table list, but Table 1 is missing in the main text.
      5. Page 8, Discussion section, it is better to discuss the differences between the proposed tool and a similar tool “batchQC”, especially the advantages of the proposed tool.
      6. Some other minor issues: Page 1, Line 22, “to do so” should be “to do it”. Page 3, Line 100, Ref. [13] should be cited when it first appears on Line 97. Page 4, Line 114 and Page 5, Line 146, “UMAP” should be given its full name when it first appears and abbreviated directly in the following text. The variable should be in italics, such as “p” on Page 4, Line 119, “H” on Page 6, Line 184.

      Reviewer 2. W. Evan Johnson and Howard Fan

      Is the source code available, and has an appropriate Open Source Initiative license (https://opensource.org/licenses) been assigned to the code?

      Yes. However, the code could use substantial improvements.

      Is installation/deployment sufficiently outlined in the paper and documentation, and does it proceed as outlined?

      No. The manuscript is missing a section describing the software and its implementation.

      Is there enough clear information in the documentation to install, run and test this tool, including information on where to seek help if required?

      Yes. But it took a while to get it installed.

      Have any claims of performance been sufficiently tested and compared to other commonly-used packages?

      No. I think the most glaring deficiency in the paper is the lack of comparison with other methods. For example, there is no comparison of the tools available in BatchEval compared to other methods, such as BatchQC. Also, they mention that BatchQC might not work on larger datasets, but they perform no performance evaluation for BatchEval, and no comparison with BatchQC to demonstrate improved performance.

      Are there (ideally real world) examples demonstrating use of the software?

      Yes. Missed opportunity--I think the most exciting thing I observed from the paper was that the example data were from spatial transcriptomics data! To my knowledge, existing batch effect methods are not directly adapted to manage these data (although they did mention tools like BatchQC cannot handle large datasets, which may be true). But they don’t mention anything about batch adjustment/evaluation in spatial data in the manuscript. I feel that if the authors address this niche it would increase the value/impact of their work!

      Additional Comments:

      This review was conducted and written by Evan Johnson, who developed the competing BatchQC software.

      The authors provide an interesting toolkit for assessing batch effects in genomics data. The paper was clear and well-written, albeit I had a few concerns (see below). We were also able to download the associated software and test it out (comments below as well).

      I think the most exciting thing I observed from the paper was that the example data were from spatial transcriptomics data! To my knowledge, existing batch effect methods are not directly adapted to manage these data (although they did mention tools like BatchQC cannot handle large datasets, which may be true). But they don’t mention anything about batch adjustment/evaluation in spatial data in the manuscript. I feel that if the authors address this niche it would increase the value/impact of their work!

      In addition, this toolkit is written in Python, while BatchQC and other tools are written in R, so this is an advantage of the method as well—it addresses an audience that uses Python for gene expression analysis (not as big as the R community, but substantial). Their Python toolkit might also be more accessible to implementation in a pipeline workflow (for a core or large project) than R-based tools like BatchQC—this might be important to mention this as well.

      I think the most glaring deficiency in the paper is the lack of comparison with other methods. For example, there is no comparison of the tools available in BatchEval compared to other methods, such as BatchQC. Also, they mention that BatchQC might not work on larger datasets, but they perform no performance evaluation for BatchEval, and no comparison with BatchQC to demonstrate improved performance.

      Similarly, the authors claim: “Manimaran [10] has developed user-friendly software for evaluating batch effects. However, the software does not take into account nonlinear batch effects and may not be able to provide objective conclusions.” I don’t understand what the authors mean by “may not be able to provide objective conclusions” – BatchQC provides – several visual and numerical evaluations of batch effect – more so than even the proposed BatchEval does. Did the authors mean something else, maybe that the lack of non-linear correction may lead to less accurate conclusions?

      A related concern: does BatchEval provide non-linear adjustments? I may have missed this, but it seems that BatchEval is not providing non-linear adjustments either. Also, regarding non-linear adjustments, the authors should show in an example the problems with a lack non-linear adjustments and show that pre-transforming the data before using BatchQC does not perform as well as the non-linear BatchEval adjustments.

      In Equation 10, should “batchScore” be BatchEvalScore?

      Also, in the bottom of Figure on page 15, should the “BatchQCScore” also be BatchEvalScore??

      The manuscript is missing a section describing the software and its implementation.

      I asked my research scientist, who recently graduated with his PhD in Bioinformatics, to assess the software and examples. First of all, much of the software is named “BatchQC”. I think this is confusing, since the method is really named BatchEval and it will be confused with BatchQC which is another existing/competing software. Furthmore, it took him a significant effort to install the BatchEval software and get is working on our cluster. I would recommend the authors make their software more accessible and easier to install.

      The output of the software was a nice .html report diagnosing the batch effects in the data—very useful (attached is a combined .pdfs of the .htmls that we generated). We were also able to generate a report for the harmony adjusted example using their code. One major disadvantage was that these reports are separate files, and this could get very complicated comparing cases using multiple batch effect methods that will all be in separate reports (refer to a recent single cell batch comparison that compared more than a dozen methods – Tran et al. Genome Biology, 2020 – it would be hard to use BatchEval for this comparison).

      Also, it seems that the user is required to conduct the batch correction themselves, BatchEval does not help with the correction except for their example code for Harmony.

      Finally, on comparing the raw and Harmony adjusted datasets, inspection of the visual assessments (e.g. PCA) show some improvement—although not a perfect correction. But must of the numerical assessments are still the sample. The BatchEvalScore in both cases leads to the conclusion “Need to do batch effect removal”. What’s missing is the difference or improvement that Harmony makes on its correction. Maybe this is just because Harmony doesn’t fully remove the batch effects? Or is there something not working in the code? Might be good to see another example where the batch effect correction improves the BatchEvalScore significantly.

      Additional Files: https://gigabyte-review.rivervalleytechnologies.com/journal/gx/download-files?YXJ0aWNsZT00NDImZmlsZT0xNzEmdHlwZT1nZW5lcmljJnZpZXc9dHJ1ZQ~~

      Re-review:

      I find this paper to be much improved in this version. The authors have clearly worked hard to address my concerns and have addressed them in a satisfactory manner. I fully support the publication of this paper, and I believe their tools are a nice addition to the field.

    1. more sophisticated models [4–11].

      Reference 6 has been retracted due to potential manipulation of the publication process. The publisher of this paper cannot vouch for its reliability, but in this case this citation does not change the conclusions of the work published here. Though we thought we would highlight this to let readers know.

    2. Qureshi MB, Azad L, Qureshi MS, et al.  Brain decoding using fMRI images for multiple subjects through deep learning. Comput Math Methods Med. 2022;2022:1–10.

      Reference 6 has been retracted due to potential manipulation of the publication process. The publisher of this paper cannot vouch for its reliability, but in this case this citation does not change the conclusions of the work published here. Though we thought we would highlight this to let readers know.

    3. [4–11].

      Reference 6 has been retracted due to potential manipulation of the publication process. The publisher of this paper cannot vouch for its reliability, but in this case this citation does not change the conclusions of the work published here. Though we thought we would highlight this to let readers know.

    1. AbstractIntegrative analysis of spatially resolved transcriptomics datasets empowers a deeper understanding of complex biological systems. However, integrating multiple tissue sections presents challenges for batch effect removal, particularly when the sections are measured by various technologies or collected at different times. Here, we propose spatiAlign, an unsupervised contrastive learning model that employs the expression of all measured genes and the spatial location of cells, to integrate multiple tissue sections. It enables the joint downstream analysis of multiple datasets not only in low-dimensional embeddings but also in the reconstructed full expression space. In benchmarking analysis, spatiAlign outperforms state-of-the-art methods in learning joint and discriminative representations for tissue sections, each potentially characterized by complex batch effects or distinct biological characteristics. Furthermore, we demonstrate the benefits of spatiAlign for the integrative analysis of time-series brain sections, including spatial clustering, differential expression analysis, and particularly trajectory inference that requires a corrected gene expression matrix.Competing Interest Statement

      Reviewer 2. Stefano Monti

      The manuscript addresses the very challenging problem of integrating multiple spatially resolved transcriptomics datasets, and proposes a novel algorithm based on multiple deep learning techniques, including DNN encoders, and self supervised and contrastive learning. Evaluation on several datasets is presented alongside comparison to multiple existing methods using several integration metrics (LISI, ARI, etc.). The presented method appears to outperform existing methods according to multiple criteria, and thus it represents a significant contribution to the field worth publishing.

      The write-up is adequate, although the description of the method very "abstract", and it would benefit from more specificity in describing the inputs and outputs of each step, how some of the models are shared (e.g., is the DNN encoder shared only across sections/samples or also across the original (Fig 1C, top) and perturbed (Fig 1C, bottom) inputs? Likewise for the Graph Encoder), and the intuition behind each of the steps included.

      Some specific comments: * It would be helpful if the results sections describing each of the applications (DLPFC datasets, Olfactory bulb datasets, etc.) were more detailed in the description of the datasets to be combined. What are the inputs (how many samples, are sections the same as samples?, how many slices per sample, etc). * Unless I'm mistaken, the labeling of Fig S1 is wrong. I think fig S1a is the UMap and S1b is the "manual annotation" rather than the other way around?

    2. AbstractIntegrative analysis of spatially resolved transcriptomics datasets empowers a deeper understanding of complex biological systems. However, integrating multiple tissue sections presents challenges for batch effect removal, particularly when the sections are measured by various technologies or collected at different times. Here, we propose spatiAlign, an unsupervised contrastive learning model that employs the expression of all measured genes and the spatial location of cells, to integrate multiple tissue sections. It enables the joint downstream analysis of multiple datasets not only in low-dimensional embeddings but also in the reconstructed full expression space. In benchmarking analysis, spatiAlign outperforms state-of-the-art methods in learning joint and discriminative representations for tissue sections, each potentially characterized by complex batch effects or distinct biological characteristics. Furthermore, we demonstrate the benefits of spatiAlign for the integrative analysis of time-series brain sections, including spatial clustering, differential expression analysis, and particularly trajectory inference that requires a corrected gene expression matrix.

      Reviewer 1. Lamda Moses.

      This papers presents spatiAlign, a package that batch corrects spatial transcriptomics data and performs spatially informed clustering. Spatial information is incorporated in the graph layers in the variational graph autoencoder which performs dimension reduction, and in the reduced dimensional space, self-supervised contrastive learning is used to batch correct and to assign cells/spots to clusters. The autoencoder then reconstructs a batch corrected gene count matrix for downstream use with methods that require a full gene count matrix. The method seems reasonable for this task and is well-described, more intuitively in the Results section and in more details in the Methods section.

      Then spatiAlign is benchmarked against several popular and state of the art methods for batch correction, including two recently published methods that use spatial information (GraphST and PRECAST) and several not using spatial information but commonly used (e.g. Seurat, Harmony, COMBAT). The choice of existing methods to benchmark is fair. The LISI F1 score is a reasonable metric to quantify performance in both batch correction and cluster separation when the spatial clusters in the brain datasets used in benchmarking are already annotated. The iLISI (batch correction) and cLISI (cluster separation), analogous to precision and recall in the original F1, are shown separately in the supplement. The F1 score is around 0.8 for spatiAlign, which is pretty good. When there is no a priori annotation, the iLISI is used to quantify how well different batches mix and Moran's I is used to indicate spatial coherence of the clusters, which are then validated with differential expression. spatiAlign is also demonstrated to integrate data from different technologies—Stereo-seq and Visium—which have different spatial resolutions. Finally, spatiAlign is demonstrated on the developing mouse brain integrating data across multiple time points.

      The language of this paper is good and does not require extensive editing for clarity. The spatiAlign package can be installed with pip and has a minimal tutorial on the documentation website.

      Overall, I find this paper well-written and a valuable contribution to this field. There are many methods that perform batch correction without using spatial information, and several that align different tissue sections, some using transcriptome information, but without correcting for batch effect in the transcriptomes. Not all methods that take spatial information into account give a batch corrected full gene count matrix as an output. The metrics reasonably demonstrate superior performance of spatiAlign compared to other methods benchmarked on the datasets used.

      Below are my questions and comments that may improve this paper:

      1. All the benchmarking datasets are from the brain, though different parts of the brain, from human and mouse, with different morphologies. The brain has a stereotypical structure. As spatiAlign uses the spatial neighborhood graph rather than the original coordinates, can it be applied to tissues without such stereotypical structure, such as tumors, skeletal muscle, colon, liver, lung, and adipose tissue? Benchmarking on a dataset from a tissue without a stereotypical structure would make a stronger case, to be more representative of the full breadth of spatial transcriptomics datasets.
      2. Biological variability is mentioned, such as from different regions of hippocampus and different stages of development. Many studies have a disease or experiment group and a control group, often with multiple subjects in each group. There are biological differences among the subjects and technical batch effects between sections, but the differences between case and control are of interest, so we have different kinds of batches. Benchmarking on a case/control study would be really helpful. How well does spatiAlign preserve biological differences between case and control while correcting for technical batch effects?
      3. The Methods section says, "Inspired by unsupervised contrastive clustering[32], we map each spot/cell i into an embedding space with d dimensions, where d is equal to the number of pseudoprototypical clusters." In Tutorial 2 on the documentation website, the latent dimension is set to be 100. Why is this number chosen? Can you clarity how to choose the number of latent dimensions? How does this affect downstream results?
      4. Since you use the k nearest neighbor graph when constructing the spatial neighborhood graph that feeds into the variational graph autoencoder, what are the reasons why k=15 is chosen? Should it be different for array-based technologies such as Visium and Stereo-seq and imaging-based technologies with single cell resolution such as MERFISH? Furthermore, due to different spatial resolutions, the spatial neighborhood graph has different biological meanings for Visium and MERFISH.
      5. All the benchmarking datasets are from array-based technologies: Visium, Slide-seq, and Stereo-seq. Imaging-based technologies are getting commercialized and getting more widely adopted, especially MERFISH and Molecular Cartography. It would be great if you benchmark using an imaging-based dataset and perhaps integrate an imaging-based and an array-based dataset, to be more representative of the full breadth of spatial transcriptomics technologies. This should also take into consideration that imaging-based datasets typically only profile a few hundred genes while array-based datasets are transcriptome-wide. This might be too much for this paper, but should at least be mentioned in the Discussions section.
      6. Is the code used to reproduce the figures available?
      7. Generally, the y axes of bar charts for F1 scores, ARI, normalized iLISI, and normalized cLISI are really confusing when they don't start at 0 and end at 1. This exaggerates how much better spatiAlign performs compared to other methods when the other methods aren't that much worse based on the numbers, such as in Figure 2c.
      8. In Supplementary Figure S4b, do you actually mean 1 - cLISI? If a smaller cLISI is better, then spatiAlign performs the worst in this case, and should have a low F1 score in Figure 2c.
      9. It would be helpful to include a computational time and memory usage benchmark.
      10. The join count statistic is a spatial autocorrelation statistic designed for binary data, and may thus be more appropriate than Moran's I to indicate spatial coherence of clusters, although Moran's I does convey the message of spatial coherence here.
      11. The documentation website can be improved by making a description of all parameters of the functions available, to explain what each parameter means and what kind of input and output is expected.
      12. It would be helpful to include preprocessing in the tutorial on the documentation website. Do we need to log normalize the data first and why? Does the data need to be scaled?

      Below are minor technical comments: 1. The notation for the LISI F1 score in the Methods sections is very confusing. Based on context and the definition of the F1 score, you probably meant to put parentheses around 1 - cLISInorm . 2. Typo in "SCAlEX" in Supplementary Figure S5a; you seem to mean "SCALEX". It's more aesthetically pleasing to be consistent in capitalizing according to the original names of the packages in Supplementary Figure S5.

      Re-review

      For the most part, the authors have satisfactorily addressed concerns raised by the reviewers. Below are my followup comments on the revised manuscript: 1. The authors missed the point of my second comment on case/control studies. What I was asking for is performance of spatiAlign and other related packages when integrating case datasets and control datasets while preserving biological differences of interest to the study. For example, data from healthy liver (control) and hepatic steatosis (case) are integrated. Case and control samples were collected from different patients and may be mounted on different slides. How well does spatiAlign preserve differences between healthy and steatosis, while correcting for technical batch effect? In Figure S7, the two sub-slices are still from the same disease condition. Case/control studies should at least be mentioned in the Discussions section. 2. The authors have provided thoughtful explanations on data scaling, number of latent dimensions, and number of neighbors in the k nearest neighbor graph in the response to reviewers. However, these explanations are not found in the manuscript or on the documentation website. Because these explanations are very relevant to users, it would be helpful to add them to either the manuscript or the documentation website. 3. For the bar charts, I suggest assigning a fixed color to each data integration method and keeping it consistent throughout this study. Right now the bar charts don't have a consistent color scheme even within the same figure. Keeping a consistent color scheme can reduce the mental burden of readers since the colors are a stand-in for the different methods. Also, a colorblind-friendly palette should be used. 4. I agree with Reviewer 3 that the grammar in this paper should be improved. For example, in lines 75-76, "in which gene expression is adjustment" should be "in which gene expression is adjusted". In lines 82-83, the "adjusted" in "laminar organization with adjusted, and clear boundaries between regions" does not make sense given the context referring to Figure 2f. In line 332, "the benchmarking methods" should be "the benchmarked methods", because the methods are being benchmarked and the methods themselves are not meant for benchmarking. Grammar in the newly added section from line 344 onwards should be corrected.

    1. Editors Assessment:

      The snake pipefish, Entelurus aequoreus, is a species of fish that dwells in open seagrass habitats in the northern Atlantic. As a pipefish, it is a member of the Syngnathidae family of fish which also includes seahorses and seadragons. In recent years it has expanded its population size and range into arctic waters. To better understand these demographic changes genomic data is useful, and to address this a high-quality reference genome has been produced. Building on a previous short-read reference, a near chromosome-scale genome assembly for the snake pipefish was assembled using PacBio CLR and Hi-C reads. After revisions the authors provided more details on the assembly metrics, the final assembly has a length of 1.6 Gbp, with scaffold and contig N50s of 62.3 Mbp and 45.0 Mbp respectively. Demographic inference analysis of the snake pipefish genome using this data enables tracing of population changes over the past 1 million years, and this reference will allow further analyses and studies relating these to changes in climate.

      **This evaluation refers to version 1 of the preprint *

    2. AbstractThe snake pipefish, Entelurus aequoreus (Linnaeus, 1758), is a slender, up to 60 cm long, northern Atlantic fish that dwells in open seagrass habitats and has recently expanded its distribution range. The snake pipefish is part of the family Syngnathidae (seahorses and pipefish) that has undergone several characteristic morphological changes, such as loss of pelvic fins and elongated snout. Here, we present a highly contiguous, near chromosome-scale genome of the snake pipefish assembled as part of a university master’s course. The final assembly has a length of 1.6 Gbp in 7,391 scaffolds, a scaffold and contig N50 of 62.3 Mbp and 45.0 Mbp and L50 of 12 and 14, respectively. The largest 28 scaffolds (>21 Mbp) span 89.7% of the assembly length. A BUSCO completeness score of 94.1% and a mapping rate above 98% suggest a high assembly completeness. Repetitive elements cover 74.93% of the genome, one of the highest proportions so far identified in vertebrate genomes. Demographic modeling using the PSMC framework indicates a peak in effective population size (50 – 100 kya) during the last interglacial period and suggests that the species might largely benefit from warmer water conditions, as seen today. Our updated snake pipefish assembly forms an important foundation for further analysis of the morphological and molecular changes unique to the family Syngnathidae.

      This work has been published in GigaByte Journal under a CC-BY 4.0 license (https://doi.org/10.46471/gigabyte.105), and has published the reviews under the same license as follows:

      Reviewer 1. Yanhong Zhang

      Are all data available and do they match the descriptions in the paper? No. There is no BioProject available for review at the link. Are the data and metadata consistent with relevant minimum information or reporting standards?

      No. "the GigaDB repository:DOI:XXXXX." I am not sure that the authors have upload the data.

      Is the data acquisition clear, complete and methodologically sound? No. I am not sure that the authors have uploaded the data.

      Is there sufficient data validation and statistical analyses of data quality? No. I need more information.

      Is the validation suitable for this type of data? No. I need more information.

      Is there sufficient information for others to reuse this dataset or integrate it with other data? No. I need more information.

      Any Additional Overall Comments to the Author:

      In line 41, you mean “50-100 kya”?

      The authors need to provide more details about the genomic data: Genome size estimation based on K-mer spectrum? Statistics of genomic characteristics from K-mer? Statistics of Hi-C sequencing raw data, such as raw bases, clean bases. Statistics of the pseduchromosome assemblies using Hi-C data. The result of BUSCO assessment, how about complete BUSCOs? complete single-copy? Statistics of gene predictions in the snake pipefish Statistics of the noncoding RNA in the snake pipefish genome. The author claims that all other data, including the repeat and gene annotation, was uploaded to the GigaDB repository: DOI: XXXXX. I cannot find these data. “DOI: XXXXX”? What does that mean?

      Reviewer 2. Sarah Flanagan

      Are all data available and do they match the descriptions in the paper?

      No. I received an NCBI link which took me to the raw data files and a BioSample description, but it did not link to the assembled and annotated genome.

      Is there sufficient detail in the methods and data-processing steps to allow reproduction?

      Yes. Only one point was not clear to me in the methods -- please clarify in the text which data was used to generate consensus genome sequences using vcfutils (lines 240-241). How did this differ from the assembled and annotated genome?

      Any Additional Overall Comments to the Author:

      In the abstract and introduction, the description of the habitat of the species is confusing and it was not clear from the manuscript as written that there are two ecotypes, one that is pelagic and one that is coastal. Consider re-phrasing these sections (lines 31-32, 57-59, and 61-62) to better describe the habitat of this species.

      Please also consider increasing the font size of the labels in Figure 1 -- the details are very difficult to read.

    1. Editors Assessment: Understanding the distribution of Anopheles mosquito species is essential for planning and implementing malaria control programmes, a task undertaken in this study that assesses the composition and distribution of the Anopheles in different districts of Kinshasa in the Democratic Republic of Congo. Mosquitoes were collected using CDC light traps, and then identified by morphological and molecular means. In total 3,839 Anopheles were collected, and data was digitised, validated and shared via the GBIF database under a CC0 waiver. The project monitoring the monthly dynamics of four species of Anopheles, showing a fluctuation in their respective frequencies during the study period. Review improved the metadata by adding more accurate date information, and this data can provide important information for further basic and advanced studies on the ecology and phenology of these vectors in West Africa.

      *This evaluation refers to version 1 of the preprint

    2. AbstractUnderstanding the distribution of Anopheles species in a region is an important task in the planning and implementation of malaria control programmes. This study was proposed to evaluate the composition and distribution of cryptic species of the main malaria vector, Anopheles gambiae complex, circulating in different districts of Kinshasa.To study the distribution of members of the An. gambiae complex, Anopheles were sampled by CDC light trap and larva collection across the four districts of Kinshasa city between July 2021 and June 2022. After morphological identification, an equal proportion of Anopheles gambiae s.l. sampled per site were subjected to polymerase chain reaction (PCR) for identification of cryptic An. gambiae complex species.The Anopheles gambiae complex was widely identified in all sites across the city of Kinshasa, with a significant difference in mean density, captured by CDC light, inside and outside households in Kinshasa (p=0.002). Two species of this complex circulate in Kinshasa: Anopheles gambiae (82.1%) and Anopheles coluzzii (17.9%). In all study sites, Anopheles gambiae was the most prevalent species. Anopheles coluzzii was very prevalent in Tshangu district. No hybrids (Anopheles coluzzii/Anopheles gambiae) were identified.Two cryptic species of the Anopheles gambiae complex circulate in Kinshasa. Anopheles gambiae s.s., present in all districts and Anopheles coluzzii, with a limited distribution. Studies on the ecology of the larval sites are essential to better understand the factors influencing the distribution of members of the An. gambiae complex in this megalopolis.

      This work has been published in GigaByte Journal under a CC-BY 4.0 license (https://doi.org/10.46471/gigabyte.104), and has published the reviews under the same license. This is part of the GigaByte Vectors of Human Disease series, and this and the other papers are hosted here. https://doi.org/10.46471/GIGABYTE_SERIES_0002

      The peer reviews are as follows.

      Reviewer 1. Paul Taconet

      Are all data available and do they match the descriptions in the paper? No

      Additional Comments 1/ The CDC light trap catch data are available in the GBIF release, but the larva collection data are not included in the release. These larva collection data should be either included in the GBIF release, or it should be made clear in the manuscript that this data is not published. 2/ in the dataset, the data are indicated to be reported at the species level (taxonRank = Species) but there are no An. coluzzii reported. However, in table 3 of the manuscript, some An. coluzzii are reported. This is inconsistent. My guess is that the data reported in the dataset are those out of the morphological identification, hence for An. gambiae at the COMPLEX level, and not the species. This shoud in any case be clarified and corrected : are the data in the dataset provided at the complex or at the species level ? If complex, the ScientificName and taxonRank columns should be corrected. In addition, in the dataset, you could add an "identificationRemarks" column providing the source of identification (morphological or molecular). 3/ in the dataset, for the species scientific name, I suggest to use the names as presented in : Harbach, R.E. 2013. Mosquito Taxonomic Inventory, https://mosquito-taxonomic-inventory.myspecies.info/ . Or at least, to provide the "nameAccordingTo" column. 4/ The data available are of type 'occurrence' ( only in 1 file - the "occurrence" file). For a better presentation of the data and in order to be in line with the GBIF data architecture, I would suggest to transform them into "sampling event" data (consisting in 1 'event core' file, 1 'occurence' file, and potentially extension files), which is more suited to this kind of data acquired from sampling events (see https://ipt.gbif.org/manual/en/ipt/latest/sampling-event-data) and containing external measurements (eg. temperature, see next point). This would enable the user to quickly understand the dates and locations of the sampling events. 5/ Temperature and humidity are included in the main 'occurence' file (column "dynamicProperties") : - to which reality these data correspond (mean during the night of collection ? ), and how were these data collected (instrument, etc.) ? this information is not provided in the manuscript. - Instead of putting this data in the "occurence" file, I would suggest to add a "measurement" file in the GBIF data release, containing these meteorological data. Doing so would enable to include metadata about these measurements (instrument, etc.) See e.g. https://www.gbif.org/sites/default/files/gbif_IPT-sample-data-primer_en.pdf page 6 6/ in the dataset, for some of the collected mosquitoes, you put "organismRemarks" = "unfed" . How did you collect this information ? I could not see any mention to this feeding identification, neither in the manuscript nor in the dataset. 7/ in the dataset, in the column "SamplingProtocol", there are spelling errors -> "CDC ligth trap cathes" should be corrected to "CDC light trap catches "

      Are the data and metadata consistent with relevant minimum information or reporting standards? No. See comments above.

      Is the data acquisition clear, complete and methodologically sound? No. See comments above

      Any Additional Overall Comments to the Author: Thanks for this nice work and the effort you put to open your data. See comments below and above to improve the work. 1/ comments for figure 1 (map) : the background layer is not very appropriate, as we miss landscape context. Maybe better to put an Open Street Map background layer, or a satellite image.

      Reviewer 2. Chris Hunter.

      Are all data available and do they match the descriptions in the paper?

      No. The larva data are not included in the GBIF dataset. Some of the descriptions of the data in the manuscript do not match the data available from GBIF. Any Additional Overall Comments to the Author:

      Major comments (Author action required): 1 - The manuscript describes larva collection and molecular identification of those species, but I cannot see any indication that those data are included in the GBIF dataset. Please clarify whether they are included or not, and if not please add them. 2 - The numbers cited in Table 1 do not match those shown in the GBIF dataset, e.g. the total of indoor/outdoor sampling events quoted in MS table 1 = 2180 / 1659, whereas in GBIF dataset there are 2304 indoor and 1535 outdoor sites listed? Please check your calculations and/or the data submitted to GBIF.

      Minor comments (Author action suggested): 1 - There are 59 events in the GBIF data that do not have a date. Please check those data and update if you have those dates available. 2 - The events are all included in the GBIF sampling event dataset, however “individualCount” data are not included, please explain why those counts are not included as observation dataset(s)? i.e. why is there no number of individual mosquitos included in the dataset? 3 - The full DwC-GBIF dataset does include an indication of the indoor/outdoor location of the sampling sites in the "eventRemark" column, but if you are making updates to the dataset may I suggest using the column heading “habitat” to include that information in GBIF either instead or as well. 4 - Ideally, the molecular identification data should be shared. I don’t have access to the “protocol of Scott [29]” but my assumption is that the PCR products are differentiated by size via running on a gel? If so, and you have the digital images of those gels please let the GigaByte editors know and they will help you share them via the GigaDB database.

      Please see the linked file "Data-Review-of-DRR-202310-03.pdf" for more details about the above concerns.

      https://gigabyte-review.rivervalleytechnologies.com/journal/gx/download-files?YXJ0aWNsZT00ODEmZmlsZT0xODMmdHlwZT1nZW5lcmljJnZpZXc9dHJ1ZQ~~

    1. Background Applying good data management and FAIR data principles (Findable, Accessible, Interoperable, and Reusable) in research projects can help disentangle knowledge discovery, study result reproducibility, and data reuse in future studies. Based on the concepts of the original FAIR principles for research data, FAIR principles for research software were recently proposed. FAIR Digital Objects enable discovery and reuse of Research Objects, including computational workflows for both humans and machines. Practical examples can help promote the adoption of FAIR practices for computational workflows in the research community. We developed a multi-omics data analysis workflow implementing FAIR practices to share it as a FAIR Digital Object.Findings We conducted a case study investigating shared patterns between multi-omics data and childhood externalizing behavior. The analysis workflow was implemented as a modular pipeline in the workflow manager Nextflow, including containers with software dependencies. We adhered to software development practices like version control, documentation, and licensing. Finally, the workflow was described with rich semantic metadata, packaged as a Research Object Crate, and shared via WorkflowHub.Conclusions Along with the packaged multi-omics data analysis workflow, we share our experiences adopting various FAIR practices and creating a FAIR Digital Object. We hope our experiences can help other researchers who develop omics data analysis workflows to turn FAIR principles into practice.

      Reviewer 3 Megan Hagenauer - Original Submission

      Review of "A Multi-omics Data Analysis Workflow Packaged as a FAIR Digital Object" by Niehues et al. for GigaScience08-31-2023I want to begin by apologizing for the tardiness of this review - my whole family caught Covid during the review period, and it has taken several weeks for us to be functional again.OverviewAs a genomics data analyst, I found this manuscript to be a fascinating, inspiring, and, quite honestly, intimidating, view into the process of making analysis code and workflow truly meet FAIR standards. I have added recommendations below for elements to add to the manuscript that would help myself and other analysts use your case study to plan out our own workflows and code release. These recommendations fall quite solidly into the "Minor Revision" category and may require some editorial oversight as this article type is new to me. Please note that I only had access to the main text of the manuscript while writing this review.Specific Comments1) As a case study, it would be useful to have more explicit discussion of the expertise and effort involved in the FAIR code release and the anticipated cost/benefit ratio:As a data analyst, I have a deep, vested interest in reproducible science and improved workflow/code reusability, but also a limited bandwidth. For me, your overview of the process of producing a FAIR code release was both inspiring and daunting, and left me with many questions about the feasibility of following in your footsteps. The value of your case study would be greatly enhanced by discussing cost/benefit in more detail:a. What sort of expertise or training was required to complete each step in the FAIR release? E.g.,i. Was your use of tools like Github, Jupyter notebook, WorkflowHub, and DockerHub something that could be completed by a scientist with introductory training in these tools, or did it require higher level use?ii. Was there any particular training required for the production of high quality user documentation or metadata? (e.g., navigating ontologies?)b. With this expertise/training in place, how much time and effort do you estimate that it took to complete each step of adapting your analysis workflow and code release to meet FAIR standards?i. Do you think this time and effort would differ if an analyst planned to meet FAIR standards for analysis code prior to initiating the analysis versus deciding post-hoc to make the release of previously created code fit FAIR standards?c. The introduction provides an excellent overview of the potential benefits of releasing FAIR analysis code/workflows. How did these benefits end up playing out within your specific case study?i. e.g., I thought this sentence in your discussion was a particularly important note about the benefits of FAIR analysis code in your study: "Developing workflows with partners across multiple institutions can pose a challenge and we experienced that a secure shared computing environment was key to the success of this project."ii. Has the FAIR analysis workflow also been useful for collaboration or training in your lab?iii. How many of the analysis modules (or other aspects of the pipeline) do you plan on reusing? In general, what do you think is the size for the audience for reuse of the FAIR code? (e.g., how many people do you think will have been saved significant amounts of work by you putting in this effort?)iv. … Or is the primary benefit mostly just improving the transparency/reproducibility of your science?d. If there is any way to easily overview these aspects of your case study (effort/time, expertise, immediate benefits) in a table or figure, that would be ideal. This is definitely the content that I would be skimming your paper to find.2) As a reusable code workflow, it would be useful to provide additional information about the data input and experimental design, so that readers can determine how easily the workflow could be adapted to their own datasets. This information could be added to the text or to Fig 1. E.g.,i. The dimensionality of the input (sample size, number of independent variables & potential co-variates, number of dependent variables in each dataset, etc)ii. Data types for the independent variables, co-variates, and dependent variables (e.g., categorical, numeric, etc)iii. Any collinearity between independent variables (e.g., nesting, confounding).3) As documentation of the analysis, it would be useful to provide additional information about how the analysis workflow may influence the interpretation of the results.a. It would be especially useful to know which aspects of the analysis were preplanned or following a standard procedure/protocol, and which aspects of the analysis were customized after reviewing the data or results. This information can help the reader assess the risk of overfitting or HARKing.b. It would also be useful to call out explicitly how certain analysis decisions change the interpretation of the results. In particular, the decision to use dimension reduction techniques within the analysis of both the independent and dependent variables, and then focus only on the top dimensions explaining the largest sources of variation within the datasets, is especially important to justify and describe its impact on the interpretation of the results. Is there reason to believe that externalizing behavior should be related to the largest sources of variation within buccal DNA methylation or urinary metabolites? Within genetic analyses, the assumption tends to be the opposite - that genetic variation related to behavior (such as externalizing) is likely to be present in a small percent of the genome, and that the top sources of variation within the genetics dataset are uninteresting (related to population) and therefore traditionally filtered out of the data prior to analysis. Within transcriptomics, if a tissue is involved in generating the behavior, some of the top dimensions explaining the largest sources of variation in the dataset may be related to that behavior, but the absolute largest sources of variation are almost always technical artifacts (e.g., processing batches, dissection batches) or impactful sources of biological noise (e.g., age, sex, cell type heterogeneity in the tissue). Is there reason to believe that cheek cells would have their main sources of epigenetic variation strongly related to externalizing behavior? (maybe as a canary in a coal mine for other whole organism events like developmental stress exposure?). Is there reason to believe that the primary variation in urinary metabolites would be related to externalizing behavior? (perhaps as a stand-in for other largescale organismal states that might be related to the behavior - hormonal states? metabolic states? inflammation?). Since the goal of this paper is to provide a case study for creating a FAIR data analysis workflow, it is less important that you have strong answers for these questions, and more important that you are transparent about how the answers to these questions change the interpretation of your results. Adding a few sentences to the discussion is probably sufficient to serve this purpose. Thank you for your hard work helping advance our field towards greater transparency and reproducibility. I look forward to seeing your paper published so that I can share it with the other analysts in our lab.

    2. Background Applying good data management and FAIR data principles (Findable, Accessible, Interoperable, and Reusable) in research projects can help disentangle knowledge discovery, study result reproducibility, and data reuse in future studies. Based on the concepts of the original FAIR principles for research data, FAIR principles for research software were recently proposed. FAIR Digital Objects enable discovery and reuse of Research Objects, including computational workflows for both humans and machines. Practical examples can help promote the adoption of FAIR practices for computational workflows in the research community. We developed a multi-omics data analysis workflow implementing FAIR practices to share it as a FAIR Digital Object.Findings We conducted a case study investigating shared patterns between multi-omics data and childhood externalizing behavior. The analysis workflow was implemented as a modular pipeline in the workflow manager Nextflow, including containers with software dependencies. We adhered to software development practices like version control, documentation, and licensing. Finally, the workflow was described with rich semantic metadata, packaged as a Research Object Crate, and shared via WorkflowHub.Conclusions Along with the packaged multi-omics data analysis workflow, we share our experiences adopting various FAIR practices and creating a FAIR Digital Object. We hope our experiences can help other researchers who develop omics data analysis workflows to turn FAIR principles into practice.

      Reviewer 2 Dominique Batista - Original Submission

      Very good paper on the FAIR side. You detail what were the challenges, in particular when it comes to the selection of ontologies and terms.It is unclear if the generation of the ISA metadata is included in the workflow. Can a user generate the metadata for the synthetic dataset or their own data using the workflow ?Adding a GitHub action running the workflow with the synthetic data would help reusability but is not required for the publication of the paper.

    3. Background Applying good data management and FAIR data principles (Findable, Accessible, Interoperable, and Reusable) in research projects can help disentangle knowledge discovery, study result reproducibility, and data reuse in future studies. Based on the concepts of the original FAIR principles for research data, FAIR principles for research software were recently proposed. FAIR Digital Objects enable discovery and reuse of Research Objects, including computational workflows for both humans and machines. Practical examples can help promote the adoption of FAIR practices for computational workflows in the research community. We developed a multi-omics data analysis workflow implementing FAIR practices to share it as a FAIR Digital Object.Findings We conducted a case study investigating shared patterns between multi-omics data and childhood externalizing behavior. The analysis workflow was implemented as a modular pipeline in the workflow manager Nextflow, including containers with software dependencies. We adhered to software development practices like version control, documentation, and licensing. Finally, the workflow was described with rich semantic metadata, packaged as a Research Object Crate, and shared via WorkflowHub.Conclusions Along with the packaged multi-omics data analysis workflow, we share our experiences adopting various FAIR practices and creating a FAIR Digital Object. We hope our experiences can help other researchers who develop omics data analysis workflows to turn FAIR principles into practice.

      This work has been published in GigaScience Journal under a CC-BY 4.0 license (https://doi.org/10.1093/gigascience/giad115), and has published the reviews under the same license. These are as follows.

      Reviewer 1 Carole Goble - Original Submission

      This work reports a multi-omics data analysis workflow packaged as a RO-Crate, an implementation of a FAIR Digital Object.We limit our comments to the technical aspects of the Research Object and workflow packaging. The scientific validity of the omics analysis itself is outside our expertise.The paper is comprehensive and the background grounding in the current state of the art is excellent and thorough. The paper is an excellent exemplar of the future of data analysis reporting for FAIR and reproducible computational methods, and the amount of work impressive. We congratulate the authors.WorkflowHub entry https://workflowhub.eu/workflows/402?version=5# gives a comprehensive report of the Nextflow workflow and its multiple versions, all the files including the R scripts and the synthetic data. The RO-Crate rendering looks correct and version-locking the R containers is following best practice(https://github.com/Xomics/ACTIONdemonstrator_workflow/blob/main/nextflow.config#L44)T he paper also highlights the amount of work needed to make such a pipeline to be both metadata machine processable and metadata human readable.To make this pipeline reproducible requires a mixture of notebooks submitted as supplementary materials, the Nextflow workflow with its R scripts represented as an RO-Crate in WorkflowHub and a README is linked to the container recipes in https://github.com/Xomics/Docker_containers and then another Documentation.md file. There seems to be the potential for duplicated effort in reporting the necessary metadata describing the workflow that could be highlighted in the Discussion as a steer to the digital object community.- Could the ROCrate approach be widened beyond the current Workflow RO-Crate, and would there be value in streamlining the metadata, or is this just an artefact of the need for multiple descriptions and ease of publishing. If the JSON within the RO-Crate was more richly annotated, could some of the Documentation.md be avoided altogether, and is that even desirable?- The README includes the container/software packaging and is not linked from the RO-Crate (and there isn't an obvious property to link to it yet). Could these be RO-Crates too?- The notebooks in the supplementary files could also be registered in WorkflowHub and linked to the Nextflow workflow (see https://workflowhub.eu/workflows?filter%5Bworkflow_type%5D=jupyter).- Is it feasible and desirable to have a single RO-Crate linked to many other RO-Crates to represent the whole reproducible pipeline in full?In the discussion the FAIR principles verification through different practices and approaches would be more helpful if it was more precise. Comments seem to be limited to the Workflow RO-Crate and use of ontologies for machine readability. As highlighted in table 1 there is more to FAIR software & workflows than this.Minor remarksKey points- We here demonstrate the implementation multiomics data -> We here demonstrate an implementation of an multi-omics data.Background- The documentation of dependencies is highlighted as a prerequisite for software interoperability. In the FAIR4RS principles I2 also highlights qualified references to other objects - presumably other software or installation requirements. This highlights the relationship between software interoperability and software portability. It seems that dependencies more relate to portability rather than interoperability.- "Based on the FDO concept, the RO-Crate approach was specified". This is a confusing statement. ROCrates have been recognised as an implementation approach for the FDO concept as proposed by the FDO Forum. For more discussion on FDO and the Linked Data approach promoted by RO-Crates see https://arxiv.org/abs/2306.07436. However, RO-Crates are not based in the FDO - they are based on the Research Object packaging work that emerged from the EU Wf4ever project, (see https://doi.org/10.1016/j.future.2011.08.004 from 2013).- It is better to describe the RO-Crate metadata file as " It contains all contextual and non-contextual related data to re-run the workflow". Instead of "It can additionally contain data on which the workflow can be run."Workflow Implementation- At the beginning of the last paragraph, "Besides the workflow and the synthetic data set" replace with "As well as the workflow and the synthetic data set".- https://workflowhub.eu/workflows/402?version=5# gives a very nice pictorial overview of the workflow that you may consider including in the paper itself.

    1. Background Machine learning (ML) has emerged as a vital asset for researchers to analyze and extract valuable information from complex datasets. However, developing an effective and robust ML pipeline can present a real challenge, demanding considerable time and effort, thereby impeding research progress. Existing tools in this landscape require a profound understanding of ML principles and programming skills. Furthermore, users are required to engage in the comprehensive configuration of their ML pipeline to obtain optimal performance.Results To address these challenges, we have developed a novel tool called Machine Learning Made Easy (MLme) that streamlines the use of ML in research, specifically focusing on classification problems at present. By integrating four essential functionalities, namely Data Exploration, AutoML, CustomML, and Visualization, MLme fulfills the diverse requirements of researchers while eliminating the need for extensive coding efforts. To demonstrate the applicability of MLme, we conducted rigorous testing on six distinct datasets, each presenting unique characteristics and challenges. Our results consistently showed promising performance across different datasets, reaffirming the versatility and effectiveness of the tool. Additionally, by utilizing MLme’s feature selection functionality, we successfully identified significant markers for CD8+ naive (BACH2), CD16+ (CD16), and CD14+ (VCAN) cell populations.Conclusion MLme serves as a valuable resource for leveraging machine learning (ML) to facilitate insightful data analysis and enhance research outcomes, while alleviating concerns related to complex coding scripts. The source code and a detailed tutorial for MLme are available at https://github.com/FunctionalUrology/MLme.

      Reviewer 2 Ryan J. Urbanowicz Revision 2

      At this point I earnestly wish to see this paper published, and in acknowledging my own potential bias as a developer of STREAMLINE and participant in the development of TPOT, I am still recommending minor revision. At minimum, for me to recommend acceptance of this paper the following small but critical issue needs to be addressed, otherwise I must recommend reject. I believe this concern is well justified by scientific standards. I also still strongly recommend the authors re-consider the other non-critical issues reiterated below as a way to make their paper stronger and to be better received by the scientific community. If the journal editor disagrees with my assessment, I would still be happy to see this work published, however I must stand by my assertions below.Critical Issue:Limitations section: The authors updated the text - "excells in it's core objective of addressing classification tasks." To "it excels in its primary objective of addressing pipeline development for classification tasks.The use of the word 'excells' is the key problem, as this word is defined as "to do or be better than others". While the change in phrasing correctly no longer implies that MLme performed better than the other evaluated AutoML tools, it does still imply that it is the best in developing a pipeline for classification tasks, but no specific evidence is provided in the paper to support this assertion. I.e. there were no studies comparing how easy the tool was for users to apply than other autoML, and no detailed comparison of what pipeline elements could be included by MLme vs other autoML or pipeline development tools. The fact that MLme doesn't include hyperparameter optimization is in itself a limitation that I think would prevent MLme from being claimed as excelling or superior in pipeline development to other tools/platforms, even if it's easier to use that other tools. As phrased in the reviewer response, the authors could say that MLme is well-equipped to handle pipeline development as this would be a fair statement. All together I'd strongly encourage the authors not to make statements about the superior aspects of MLme without clearly backing up these statements with direct comparisons. Instead I'd suggest highlighting elements of MLme that are 'unique' or provide more functionality in contrast with other tools. In the reviewer response the authors make the claim that MLme is superior in terms of ease of use for visualization and exploratory analysis. If they want to make that statement in the paper backed up by accurate comparisons to other tools, I'd agree with that addition.Non-Critical Issues that I feel still should be addressed:1. Table S1 has been updated to remove the inaccuracies I previously pointed out, however this alone does not change the broader concern I had regarding the intention of this table (which is to highlight the parts of MLme that appear better than other AutoML tools without fairly pointing out the limitations of MLme in contrast with other tools). As a supplemental materials table, I do not feel this is critical, but I think making a table that more fairly reflects strengths and limitations of different tools would greatly strengthen this paper.2. The pipeline design in Figure 2 and and S10 are both high-level and still do not provide enough detail/clarity to understand exactly what happens and in what order when applying the autoML element of MLme. They key words here being transparency and reproducibility. The supplemental materials could describe a detailed walk through of what the autoML does at each step. At minimum this could also be clearly addressed in the software documentation on GitHub.3. While I understand the need for brevity, I think the addition of a sentence that indicates specifically what AutoML tools are most similar to MLme is a reasonable request that better places MLme in the context of the greater AutoML research space.

    2. Background Machine learning (ML) has emerged as a vital asset for researchers to analyze and extract valuable information from complex datasets. However, developing an effective and robust ML pipeline can present a real challenge, demanding considerable time and effort, thereby impeding research progress. Existing tools in this landscape require a profound understanding of ML principles and programming skills. Furthermore, users are required to engage in the comprehensive configuration of their ML pipeline to obtain optimal performance.Results To address these challenges, we have developed a novel tool called Machine Learning Made Easy (MLme) that streamlines the use of ML in research, specifically focusing on classification problems at present. By integrating four essential functionalities, namely Data Exploration, AutoML, CustomML, and Visualization, MLme fulfills the diverse requirements of researchers while eliminating the need for extensive coding efforts. To demonstrate the applicability of MLme, we conducted rigorous testing on six distinct datasets, each presenting unique characteristics and challenges. Our results consistently showed promising performance across different datasets, reaffirming the versatility and effectiveness of the tool. Additionally, by utilizing MLme’s feature selection functionality, we successfully identified significant markers for CD8+ naive (BACH2), CD16+ (CD16), and CD14+ (VCAN) cell populations.Conclusion MLme serves as a valuable resource for leveraging machine learning (ML) to facilitate insightful data analysis and enhance research outcomes, while alleviating concerns related to complex coding scripts. The source code and a detailed tutorial for MLme are available at https://github.com/FunctionalUrology/MLme.

      **Reviewer 2 Ryan J. Urbanowicz ** Revision 1

      Overall I think the authors have made some good improvements to this paper, although it does not seem like the main body of the paper has changed much with most of the updates going into supplemental materials. However, I think this work is worthy of publication once the following items are addressed. (which I still feel strongly should be addressed, but should be fairly easy to do so).

      1. Limitations section: While the authors added some basic comparisons to a few other AutoML tools, I do not see how they are justified in saying that MLme 'excells' in it's core objective of addressing classification tasks. This implies it is better performing a classification than other methods, which is not at all backed up here, and indeed would be very difficult to prove as it would require a huge amount of analyes over a broad range of simulated and real world benchmark datasets, and incomparison to many or all orther other autoML tools. At best i think the authors can say here that it is at least comparable in performance to AutoML tools (X, Y, Z) in its ability to conduct classification analyses. And according to Figure S9 this is only across 7 datasets, and focused only on the F1 score which could also be missleading or cherry picked. At best I believe the authors can say in the paper that "Initial evaluation across 7 datasets suggested that MLMe performed comparably to TPOT and Hyperopt-sklearn with respect to F1 score performance. This suggests that MLme is effective as an automated ML tool for classification tasks. " (or something similar).

      2. While the authors lengthened the supplemental materials table comparing ML algorithms (mainly by adding some other autoML tools, this table is intentionally presenting the capabilities of tools in a way that make it appear like MLme does the most (with the exception of the 'features' column) . For example, what about a column to indicate if an autoML tool has an automated pipeline discovery component (like TPOT)? In terms of AutoML, this table is structured to highlight the benefits of MLme, rather than give a fair comparison of AutoML tools (which is my major concern here). In terms of AutoML performance and usability there is alot more to these different tools than the 6 columns presented. In this table 'features' seems like an afterthought, but is arguably the most important aspect of an AutoML.

      3. Additionally, the information presented in the autoML comparison table does not seem to be entirely accurate, or at least how the columns are defined is not made entirely clear. Looking at STREAMLINE, which can be run by users with no coding experience (as a google colab notebook), it has a code free option (just not a GUI), STREAMLINE also generates more than two exploratory analysis plots, and more results visualizations plots than indicated). While I agree that MLme has many more ease of use functionality in comparison to STREAMLINE (which is a very nice plus), a reader might look at this table and think they need to know how to code in order to use STREAMLINE, which is not the case. Could the authors at least define their criteria for the "code free" column. As it's presented now it seems to be the same exact criteria as for GUI (in which case this is redundant). The same is true for the legend for the table where '*' indicates that coding experience is required for designing a custom pipeline. This requires more clarification, as STREAMLINE can be customized easily without coding experience by simply changing options in the Google Colab notebook, and TPOT automatically discovers new analysis pipelines which isn't reflected at all.

      4. While I appreciate the authors adding a citation for STREAMLINE and some other autoML tools not previously cited, it would be nice for the authors to discuss other AutoML tools further in their main paper, as well as to acknowledge in the main paper which AutoML tools are most similar to MLme in overall design and capabilities. Based on my own review of AutoML tools the most similar tools would include STREAMLINE and MLIJAR-supervised.

      5. I like the addition of Figure S10 that more clearly lays out the elements included in MLme, but I still think the paper and documentation lacks a clear and transparent walk through of exactly what happens to the data and how the analyses are conducted from start to finish when using the AutoML (at least by default). This is important to trusting what happens under the hood for reporting results, etc.

      Other comments responding to author responses: * I still disagree with the authors that a dataset with up to 1500 samples or up to 5520 features could be considered large by today's standards across all research domains. Even within biomedical data, datasets up to 100K subjects are becoming common, and 'omics' datasets regularly reach hundreds of thousands to multiple millions of features. I am glad to see the authors adding a larger dataset, but i would still be cautions when making suggestions about how well MLme handles 'large' datasets without including specifics for context. However ultimately this is subjective, and not preventing me from endorsing publication. * I also disagree that MLme isn't introducing a new methodology. The steps comprising an AutoML tool can be considered in itself a new methodology, even if it is built on established components, because there are still innumerable ways to put a machine learning analysis pipeline together that adds bias, data leakage, or just yields poorer performance. Thus I also don't think it's fair to just 'assume' your method will work as well as other AutoML tools, especially when you've ran it on a limited number of datasets/problems.

    3. Background Machine learning (ML) has emerged as a vital asset for researchers to analyze and extract valuable information from complex datasets. However, developing an effective and robust ML pipeline can present a real challenge, demanding considerable time and effort, thereby impeding research progress. Existing tools in this landscape require a profound understanding of ML principles and programming skills. Furthermore, users are required to engage in the comprehensive configuration of their ML pipeline to obtain optimal performance.Results To address these challenges, we have developed a novel tool called Machine Learning Made Easy (MLme) that streamlines the use of ML in research, specifically focusing on classification problems at present. By integrating four essential functionalities, namely Data Exploration, AutoML, CustomML, and Visualization, MLme fulfills the diverse requirements of researchers while eliminating the need for extensive coding efforts. To demonstrate the applicability of MLme, we conducted rigorous testing on six distinct datasets, each presenting unique characteristics and challenges. Our results consistently showed promising performance across different datasets, reaffirming the versatility and effectiveness of the tool. Additionally, by utilizing MLme’s feature selection functionality, we successfully identified significant markers for CD8+ naive (BACH2), CD16+ (CD16), and CD14+ (VCAN) cell populations.Conclusion MLme serves as a valuable resource for leveraging machine learning (ML) to facilitate insightful data analysis and enhance research outcomes, while alleviating concerns related to complex coding scripts. The source code and a detailed tutorial for MLme are available at https://github.com/FunctionalUrology/MLme.

      Reviewer 1 Joe Greener Revision 1

      The authors have adequately addressed my concerns and I believe that the manuscript is ready for publication.

    4. Background Machine learning (ML) has emerged as a vital asset for researchers to analyze and extract valuable information from complex datasets. However, developing an effective and robust ML pipeline can present a real challenge, demanding considerable time and effort, thereby impeding research progress. Existing tools in this landscape require a profound understanding of ML principles and programming skills. Furthermore, users are required to engage in the comprehensive configuration of their ML pipeline to obtain optimal performance.Results To address these challenges, we have developed a novel tool called Machine Learning Made Easy (MLme) that streamlines the use of ML in research, specifically focusing on classification problems at present. By integrating four essential functionalities, namely Data Exploration, AutoML, CustomML, and Visualization, MLme fulfills the diverse requirements of researchers while eliminating the need for extensive coding efforts. To demonstrate the applicability of MLme, we conducted rigorous testing on six distinct datasets, each presenting unique characteristics and challenges. Our results consistently showed promising performance across different datasets, reaffirming the versatility and effectiveness of the tool. Additionally, by utilizing MLme’s feature selection functionality, we successfully identified significant markers for CD8+ naive (BACH2), CD16+ (CD16), and CD14+ (VCAN) cell populations.Conclusion MLme serves as a valuable resource for leveraging machine learning (ML) to facilitate insightful data analysis and enhance research outcomes, while alleviating concerns related to complex coding scripts. The source code and a detailed tutorial for MLme are available at https://github.com/FunctionalUrology/MLme.

      ** Reviewer 2 Ryan J. Urbanowicz ** Original Submission

      In this paper the authors introduce MLme, a comprehensive toolkit for machine-learning driven analysis. The authors discuss the benefits and limitations of their toolkit and provide a demonstration evaluation on 6 datasets suggesting it's potential value. Overall MLme seems like a nice, easy to use tool with a good deal of potential and value. However as the developer of STREAMLINE, an AutoML toolkit with a number of very similar goals and design architecture to MLme it was very surprising to have it not referenced or compared to in this paper. My major concerns involve the limited details about what this specifically does/includes (e.g. what 16 ML algorithms are built in), as well as what seems like a limited and largely biased comparison of this toolkit's capabilities to other autoML tools (most specifically STREAMLINE which has a significant degree of similarity).- There are many other autoML tools out there that that authors have not considered in their Table S1 or referenced. Eg. MLBox, AutoWeka, H20, Devol, Auto-Keras, TransmorgriffAI, and most glaringly in for this reviewer, STREAMLINE (https://github.com/UrbsLab/STREAMLINE).- In particular, with respect to STREAMLINE (https://link.springer.com/chapter/10.1007/978-981-19-8460-0_9), there are a large number of pipeline similarities and a similar analysis mission/goals to MLme that make it extremely relevant to cite and contrast to in this manuscript as well as in Table S1. STREAMLINE has a similar focus on the end-to-end ML analysis pipeline including automated exploratory analysis, data processing, feature selection, ML modeling with 16 algorithms, evaluation, and results visualization generation, interactive visualizations, pickled output storage, etc. The first STREAMLINE paper was published March of 2023, and a preprint of that manuscript published June 2022, as well as a precursor implementation of this pipeline published as a preprint in Aug of 2020 (https://arxiv.org/abs/2008.12829). This in contrast with MLme's preprint published July of 2023. While MLme has a number of potentially nice features that STREAMLINE does not (i.e. a GUI interface, spider plots, easy color palate selection, inclusion of a dummy classifier, ability to handle multi-class classification [which is not yet available, but in development for STREAMLINE along with regression]), it lacks other potentially important features that STREAMLINE does have (i.e. automated hyperparameter optimization, basic data cleaning and feature engineering [in the newest release], collective feature selection, pickled models for later reuse, collective feature importance visualizations, a pdf analysis summary report, the ability to quickly evaluate models on new replication data, and potentially other capabilities that I can't highlight because of limited details on what MLme includes). The absence of hyperparameter optimization is a particularly problematic omission from MLme, as this a fairly critical element of any machine learning analysis pipeline.-Table S1 should be expanded to highlight a broader range of toolkit features to better highlight the strengths and weaknesses of a greater variety of methodologies. The present table seems a bit cherry picked to make MLme stand out as appearing to have more capabilities than other tools, but there are uncaptured advantages to these other approaches.-This manuscript includes no citations justifying their pipeline design choices. In particular, I'm most concerned with the author's justification of automatically including data resampling by default as it is well known that this can introduce bias in modeling. It's also not clear what determines if data resampling is required, and whether this only impacts training data or also testing data.- Its not clear that resampling is a good/reliable strategy for an automated machine learning framework since data resampling to create a more balanced dataset can also incorporate bias in to an ML model.- In the context of potential datasets from different domains (including biomedical data), the datasets identified in this paper as being "large" have only up to 1500 sample and only up to 5520 features, which would not be considered large by most data scientist standards.- There are largely limited details in this paper and the software's github documentation in terms of transparently indicating exactly what this pipeline does, and what options, algorithms, evaluation metrics, and visualizations it includes.- Since the authors do not benchmark MLme against any other autoML tool and they have a very limited set of benchmarked datasets (6 total, with limited diversity of data types, sizes, feature types), I don't think it's fair to claim that their methodology necessarily excels in it's core objective of addressing classification tasks. Ideally the authors would conduct benchmarking comparisons to STREAMLINE, as well as other autoML toolkits, however this might also understandably be outside the scope of this current paper. I do suggest the authors be more conservative in what assertions they make and conclusions they draw with respect to MLme. The authors might consider using established ML or AutoML benchmark benchmark datasets used by other algorithms and frameworks to compare or facilitate comparison of their pipeline toolkit to others.

    5. Background Machine learning (ML) has emerged as a vital asset for researchers to analyze and extract valuable information from complex datasets. However, developing an effective and robust ML pipeline can present a real challenge, demanding considerable time and effort, thereby impeding research progress. Existing tools in this landscape require a profound understanding of ML principles and programming skills. Furthermore, users are required to engage in the comprehensive configuration of their ML pipeline to obtain optimal performance.Results To address these challenges, we have developed a novel tool called Machine Learning Made Easy (MLme) that streamlines the use of ML in research, specifically focusing on classification problems at present. By integrating four essential functionalities, namely Data Exploration, AutoML, CustomML, and Visualization, MLme fulfills the diverse requirements of researchers while eliminating the need for extensive coding efforts. To demonstrate the applicability of MLme, we conducted rigorous testing on six distinct datasets, each presenting unique characteristics and challenges. Our results consistently showed promising performance across different datasets, reaffirming the versatility and effectiveness of the tool. Additionally, by utilizing MLme’s feature selection functionality, we successfully identified significant markers for CD8+ naive (BACH2), CD16+ (CD16), and CD14+ (VCAN) cell populations.Conclusion MLme serves as a valuable resource for leveraging machine learning (ML) to facilitate insightful data analysis and enhance research outcomes, while alleviating concerns related to complex coding scripts. The source code and a detailed tutorial for MLme are available at https://github.com/FunctionalUrology/MLme.

      This work has been published in GigaScience Journal under a CC-BY 4.0 license (https://doi.org/10.1093/gigascience/giad111), and has published the reviews under the same license. These are as follows.

      ** Reviewer 1 Joe Greener ** Original Submission

      Akshay et al. present MLme, a toolkit for exploring data and automatically running machine learning (ML) models. This software could be useful for those with less experience in ML. I believe it is suitable for publication provided the following points are addressed.# Major1. The performance of models is consistently over 90% but without a reference point it is unclear how good this is. Are there results from previous studies on the same data that can be compared to, with a table comparing accuracy with MLme to previous work? Otherwise it is unclear whether MLme is supposed to be a quick way to have a first go at prediction on the data or can entirely replace manual model refinement.2. With any automated ML system it is important to impress upon users the risks of ML. For example, the splitting of data into training and test sets is done randomly, but there are cases where this is not appropriate as it will lead to data leakage between the training and test sets. This could be mentioned in the manuscript and somewhere on the GUI. There isn't really a replacement for domain knowledge, and users of MLme should have this in mind when using the software.# Minor3. More experienced ML users may want to use the software to have a first go at prediction on the data. For these users it may be useful to provide access to commands or scripts, or at least information on which functions were used, as additional options in the GUI. Users could then run these scripts themselves to tweak hyperparameters etc.4. The visualisation tab lacks an info button by the file upload to say what the file format should be.

    1. The rapidly growing collection of public single-cell sequencing data have become a valuable resource for molecular, cellular and microbial discovery. Previous studies mostly overlooked detecting pathogens in human single-cell sequencing data. Moreover, existing bioinformatics tools lack the scalability to deal with big public data. We introduce Vulture, a scalable cloud-based pipeline that performs microbial calling for single-cell RNA sequencing (scRNA-seq) data, enabling meta-analysis of host-microbial studies from the public domain. In our scalability benchmarking experiments, Vulture can outperform the state-of-the-art cloud-based pipeline Cumulus with a 40% and 80% reduction of runtime and cost, respectively. Furthermore, Vulture is 2-10 times faster than PathogenTrack and Venus, while generating comparable results. We applied Vulture to two COVID-19, three hepatocellular carcinoma (HCC), and two gastric cancer human patient cohorts with public sequencing reads data from scRNA-seq experiments and discovered cell-type specific enrichment of SARS-CoV2, hepatitis B virus (HBV), and H. pylori positive cells, respectively. In the HCC analysis, all cohorts showed hepatocyte-only enrichment of HBV, with cell subtype-associated HBV enrichment based on inferred copy number variations. In summary, Vulture presents a scalable and economical framework to mine unknown host-microbial interactions from large-scale public scRNA-seq data. Vulture is available via an open-source license at https://github.com/holab-hku/Vulture.

      ** Reviewer 1 Liuyang Zhao ** R1 version

      The manuscript presented by the authors provides a useful tool on the microbiome, which named "Vulture: Cloud-enabled scalable mining of microbial reads in public scRNA-seq data", using a large and valuable dataset. The study is important in deepening our understanding of "microbiome in public data". However, the author comments not fully address my concerned, there are some issues for improvement in the manuscript. Here are the requirements for new software that is good enough to be published: 1. A docker provided is better, however, most used install method conda is still missing. 2. The more microbial detect example is missing. Can you provide an example of using like Kraken2 full NCBI database (RefSeq) to check all the microbial is more useful. 3. Author still not promotion his software in social media. If no more people take part in use it, how can we know it's useful? The reviewers still have may work to do. Not have enough time to test this software. Just promote it in twitter and Chinese WeChat will help software better. 4. The software name should be unique, which is convenient to count the real users through all available resources (such as QIIME, ImageGP, and EasyAmplicon). However, the name vulture is unacceptable, due to millions of hits in Google scholar. Must be no hit is a unique name,OK? Otherwise, hardly to know the read number of users. 5. The source code to support the generation of individual figures in this paper will be available on the GigaDB after being published. Where to check by the reviewers?

    2. The rapidly growing collection of public single-cell sequencing data have become a valuable resource for molecular, cellular and microbial discovery. Previous studies mostly overlooked detecting pathogens in human single-cell sequencing data. Moreover, existing bioinformatics tools lack the scalability to deal with big public data. We introduce Vulture, a scalable cloud-based pipeline that performs microbial calling for single-cell RNA sequencing (scRNA-seq) data, enabling meta-analysis of host-microbial studies from the public domain. In our scalability benchmarking experiments, Vulture can outperform the state-of-the-art cloud-based pipeline Cumulus with a 40% and 80% reduction of runtime and cost, respectively. Furthermore, Vulture is 2-10 times faster than PathogenTrack and Venus, while generating comparable results. We applied Vulture to two COVID-19, three hepatocellular carcinoma (HCC), and two gastric cancer human patient cohorts with public sequencing reads data from scRNA-seq experiments and discovered cell-type specific enrichment of SARS-CoV2, hepatitis B virus (HBV), and H. pylori positive cells, respectively. In the HCC analysis, all cohorts showed hepatocyte-only enrichment of HBV, with cell subtype-associated HBV enrichment based on inferred copy number variations. In summary, Vulture presents a scalable and economical framework to mine unknown host-microbial interactions from large-scale public scRNA-seq data. Vulture is available via an open-source license at https://github.com/holab-hku/Vulture.

      ** Reviewer 3 Liuyang Zhao ** Original submission

      The authors aim to develop Cloud-enabled approaches for detecting viral reads in public single-cell RNA sequencing (scRNA-seq) data. This study makes a significant contribution to the identification of viruses and bacteria in public scRNA-seq data. Although the outcomes are satisfactory, the novelty of the proposed methods is limited. To date, no evidence has been provided to demonstrate their superiority over recently published methods (such as PathogenTrack and Venus, et al) when executed on a local machine. There are also several issues that need to be further addressed, as highlighted below: 1.The documentation available on the GitHub pipeline does not explain how to utilize the latest virus database or how to incorporate a user's custom database. Because the virus database is updated very quickly now. It might be more appropriate if the author updates the database promptly or if one can customize and create their own database. 2. Figure 2a only has an overall comparison graph, it can be improved by adding detailed comparison graphs with Cumulus, PathogenTrack and Venus. 3. Figure 2b. The persuasiveness is not enough, it would be better to compare several pipeline platforms with similar functionalities or compare some specific steps, such as the four steps in figure 2a. By the way, all of these comparisons use comparison software developed by other same researchers, so please provide a detailed description of why the author's method is faster? 4. Figure 3c can be created with microbial clustering and non-microbial clustering to highlight the impact of virus identification on classification results. 5. Fig. S1 It should be the "Quality control on read level".

    3. The rapidly growing collection of public single-cell sequencing data have become a valuable resource for molecular, cellular and microbial discovery. Previous studies mostly overlooked detecting pathogens in human single-cell sequencing data. Moreover, existing bioinformatics tools lack the scalability to deal with big public data. We introduce Vulture, a scalable cloud-based pipeline that performs microbial calling for single-cell RNA sequencing (scRNA-seq) data, enabling meta-analysis of host-microbial studies from the public domain. In our scalability benchmarking experiments, Vulture can outperform the state-of-the-art cloud-based pipeline Cumulus with a 40% and 80% reduction of runtime and cost, respectively. Furthermore, Vulture is 2-10 times faster than PathogenTrack and Venus, while generating comparable results. We applied Vulture to two COVID-19, three hepatocellular carcinoma (HCC), and two gastric cancer human patient cohorts with public sequencing reads data from scRNA-seq experiments and discovered cell-type specific enrichment of SARS-CoV2, hepatitis B virus (HBV), and H. pylori positive cells, respectively. In the HCC analysis, all cohorts showed hepatocyte-only enrichment of HBV, with cell subtype-associated HBV enrichment based on inferred copy number variations. In summary, Vulture presents a scalable and economical framework to mine unknown host-microbial interactions from large-scale public scRNA-seq data. Vulture is available via an open-source license at https://github.com/holab-hku/Vulture.

      ** Reviewer 2 Jingzhe Jiang ** Original submission

      In this study, Chen et al. introduce Vulture, a scalable cloud-based pipeline that performs microbial calling for single-cell RNA sequencing (scRNA-seq) data, enabling meta-analysis of host-microbial studies from the public domain. And they further applied Vulture to COVID-19, HCC, and gastric cancer human patient cohorts with public sequencing reads dataand discovered cell-type specific enrichment of SARSCoV2, hepatitis B virus (HBV), and H. pylori positive cells. Generally speaking, this study is innovative, has good application potential, and can better assist the work of single cell research from the point of view of infection. I only a few minor questions that need the author to reply: 1. Background: The first appearance of H. pylori should be replaced with its full name. 2. Methods-Downstream analysis of scRNA-seq samples: Why use different tools (SCANPY/Seurat, BBKNN/Harmony) to analyze different datasets instead of using the same tool to analyze different datasets? 3. Cell-type enrichment of microbial UMI: format error of formula. 4. Analyses-Page 11: "The statistical test identified that SARS-CoV-2 is enriched (p-value < 0.05) in epithelial cells, neutrophils, and plasma B cells (Fig. 3d and Table. 2)". It is best to highlight p < 0.05 data points in other colors rather than red squares. Why are there no p < 0.05 square in fig. 3e? 5. Fig. 2a and 2b: There are 8 colors in figure 2a, however only 4 figure legend were showed. What do the four light-colored bar mean? And the same to Fig 2b.

    4. The rapidly growing collection of public single-cell sequencing data have become a valuable resource for molecular, cellular and microbial discovery. Previous studies mostly overlooked detecting pathogens in human single-cell sequencing data. Moreover, existing bioinformatics tools lack the scalability to deal with big public data. We introduce Vulture, a scalable cloud-based pipeline that performs microbial calling for single-cell RNA sequencing (scRNA-seq) data, enabling meta-analysis of host-microbial studies from the public domain. In our scalability benchmarking experiments, Vulture can outperform the state-of-the-art cloud-based pipeline Cumulus with a 40% and 80% reduction of runtime and cost, respectively. Furthermore, Vulture is 2-10 times faster than PathogenTrack and Venus, while generating comparable results. We applied Vulture to two COVID-19, three hepatocellular carcinoma (HCC), and two gastric cancer human patient cohorts with public sequencing reads data from scRNA-seq experiments and discovered cell-type specific enrichment of SARS-CoV2, hepatitis B virus (HBV), and H. pylori positive cells, respectively. In the HCC analysis, all cohorts showed hepatocyte-only enrichment of HBV, with cell subtype-associated HBV enrichment based on inferred copy number variations. In summary, Vulture presents a scalable and economical framework to mine unknown host-microbial interactions from large-scale public scRNA-seq data. Vulture is available via an open-source license at https://github.com/holab-hku/Vulture.

      This work has been published in GigaScience Journal under a CC-BY 4.0 license (https://doi.org/10.1093/gigascience/giad117), and has published the reviews under the same license. These are as follows.

      ** Reviewer 1 Yongxin Liu** Original submission

      The manuscript presented by the authors provides a useful tool on the virome, which named "Vulture: Cloud-enabled scalable mining of viral reads in public scRNA-seq data", using a large and valuable dataset. The study is important in deepening our understanding of "virome in public data". However, there are some issues for improvement in the manuscript. Here are the requirements for new software that is good enough to be published: Major comments: 1. The software, tested data and results are required to be uploaded on GitHub for peers to use, and conda and/or docker installation modes are recommended for software with complex dependencies. We will take software Star, Fork, and downloads of GitHub as one of the audience indicators. I found the GitHub links: https://github.com/holab-hku/Vulture. However, the readme.md show pipeline on AWS cloud. If I not have an AWS, how can I run it in my server. Now this project is only 2 stars. You need more people to take part in and interest in this project. 2. Software installation and User tutorial are required in Readme.md or Wiki in GitHub. Please provide step by step protocol to deploy it in the laptop or server. 3. A video of software download, installation, operation, and result display is required with a computer or server without any related software installed, to make sure that any new user can perform the whole process according to the tutorial. 4. The software is required to be posted on twitter and other social media, you can contact @ iMetaScience, @microbe_article etc. to get help in tweet or retweet. The number of Retweet, Like and View as one of the audience indicators. 5. Chinese is largest single langue science society. Provide the Chinese tutorial and video presentation of the software, contact meta-genome Official account for help to promote. The Number of readers, share and favorite also one of the audience indicators. 6. According to the feedback from users in all over the world, the author continuously maintains and optimizes the method to ensure its availability, ease of use and advancement. 7. The software name should be unique, which is convenient to count the real users through all available resources (such as QIIME, ImageGP, and EasyAmplicon). However, the name vulture is unacceptable, due to million of hits in Google scholar. 8. The figures in your papers are diversity. However, I cannot find enough visualization function in your pipeline. The pipeline for integrated software is easy, the specific and diversity visualization plan is difficult. All the authors want their analysis result is ready-to-published. 9. Why only focus on the virus? Can this pipeline to generated all the microbiome, which is more interest and overview of the microbes.

  7. Jan 2024
    1. Competing Interest StatementThe authors have declared no competing interest.

      Reviewer 2--Julia Voelker

      The manuscript about NLR-type resistance genes in two haplotypes of Melaleuca quinquenervia is a relevant contribution to the research of Myrtaceae genomes and other long-lived trees. The methods are well described and should be reproducible with the available information and raw data, provided the authors mentioned all non-default settings in the method section. The FindPlantNLRs pipeline seems to be well documented on github.

      I believe that this manuscript is ready for publication after some small changes. Page and line numbers in the comments below refer to the PDF document: 1. The quality of some figures is not good (even upon download and zoom into the plot) and should be improved to higher resolution for publication. Especially in figure 3, all labels are too pixelated and hard to read. I would also recommend an increase in text size for this figure. In Figure 6 D & E, the authors should consider using consistent text sizes on the axes, and even though the quality is acceptable, a higher resolution of the labels would still be better.

      1. p. 10, Table 2: Although it is a standard statistic for genome assemblies, it would be helpful for some readers to specify what N50 and L50 are.

      2. p. 19, line 436: I believe the authors are referring to the wrong figure number.

      Below are some additional comments regarding typos or other language issues. While the text is generally well written, I would appreciate commas in certain sentences to improve readability, and think that some nouns are missing articles. I hope the authors will read through their text again and add articles where required, I won't point them out individually.

      p.4, line 33: wide range of p.7, line 130: 'a' instead of 8? p. 8, line 177: genome p.12, line 250: chromosome 2, add comma before 'while' in next line p.12, line 253: on all other chromosomes? p. 13, line 271: to occur? p.16, line 347: remove 'and' p.17, line 382, 384: orthologs? p.20, line 469: 'lead to the triggering of defence response' rephrase to make sense with the previous half of the sentence, also, defence response should have an article p.20, line 489/490: missing word?

    2. Background The coastal wetland tree species Melaleuca quinquenervia (Cav.) S.T.Blake (Myrtaceae), commonly named the broad-leaved paperbark, is a foundation species in eastern Australia, Indonesia, Papua New Guinea, and New Caledonia. The species has been widely grown as an ornamental, becoming invasive in areas such as Florida in the United States. Long-lived trees must respond to a wide range pests and pathogens throughout their lifespan, and immune receptors encoded by the nucleotide- binding domain and leucine-rich repeat containing (NLR) gene family play a key role in plant stress responses. Expansion of this gene family is driven largely by tandem duplication, resulting in a clustering arrangement on chromosomes. Due to this clustering and their highly repetitive domain structure, comprehensive annotation of NLR encoding genes within genomes has been difficult. Additionally, as many genomes are still presented in their haploid, collapsed state, the full allelic diversity of the NLR gene family has not been widely published for outcrossing tree species.Results We assembled a chromosome-level pseudo-phased genome for M. quinquenervia and describe the full allelic diversity of plant NLRs using the novel FindPlantNLRs pipeline. Analysis reveals variation in the number of NLR genes on each haplotype, differences in clusters and in the types and numbers of novel integrated domains.Conclusions We anticipate that the high quality of the genome for M. quinquenervia will provide a new framework for functional and evolutionary studies into this important tree species. Our results indicate a likely role for maintenance of NLR allelic diversity to enable response to environmental stress, and we suggest that this allelic diversity may be even more important for long-lived plants.

      Reviewer 1– Andrew Read – University of Minnesota

      In the manuscript, A high-quality pseudo-phased genome for Melaleuca quinquenervia shows allelic diversity of NLR-type resistance genes, the authors assemble and analyze a phased genome of a long-lived tree species. In addition to providing a phased genomic resource for an important species, the authors analyze and compare the NLR gene complement in each of the two diploid genomes. I was surprised by the level of diversity of NLR genes in the two copies of the genome (this may be due to my biases based on working in highly homozygous species). This level of within-individual diversity has been largely overlooked by researchers owing to the difficulties of sequencing, assembly, and NLR identification. To address NLR identification, the authors publish a very nice pipeline that combines available tools into a framework that makes a lot of sense to me and will be valuable to anyone doing NLR gene work on new or existing genome assemblies. My main concern comes from not knowing how sequencing gaps and NLRs correlate across the two diploid genomes. Other than this, I think it’s a very nice paper that adds to the growing catalog of NLR gene diversity by tackling the challenge of NLRs in a heterozygous genome.

      Many of the authors’ interesting observations are based on comparisons of NLRs on the two haploid genomes, however some things are not clear to me:
      1.  Do any predicted NLR-genes overlap gaps in the alternative haploid genome? 
      2.  If there is a predicted NLR-gene in one haploid genome and not the alternative genome, what is at the locus? Is it a structural variant indicating insertion/deletion of the NLR or is there ‘NLR-like’ sequence there that just didn’t pass the pipeline filters indicating an NLR fossil (or similar) – to me this is an important distinction.
      3.  How many of the NLR-genes on the two haploid genomes cluster 1:1 with their homolog on the alternative haploid genome – I’m particularly interested in the 15 ‘mismatched’ N-term-NBARC examples. It would be nice to know if these have partners in the alternative haploid genome, and if the partner has the same mismatch (if not, it would support the proposed domain swapping story)
      I believe each of these concerns will require whole genome alignment of the two haploid genomes.
      

      Additional comments (by line where indicated) The authors introduce the idea that M. quinquenervia is invasive in Florida, but this thread is never followed up on in the discussion and makes it feel a bit awkward. It would help if the authors clarified how the genome could help with management in native and invasive ranges

      Could the authors add some context for why ONT data was included and how it was used?

      It would be helpful if the authors provided a weblink to the iTOL tree

      164-166 – The observation of inversions potentially caused by assembly errors is nice!

      206 – add reference: Bayer PE, Edwards D, Batley J (2018) Bias in resistance gene prediction due to repeat masking. Nat Plants 4: 762–765. pmid:30287950

      240-246 – I’m not sure about excluding these incomplete NLRs – it would be interesting and potentially informative to see where they cluster (do they cluster with an NLR from the alternative haplotype? If so it may indicate truncation of one copy, etc) – however, if the author’s wish to remove these at this step I think they can add a statement like “we were interested in full-length NLRs, the filtered incomplete NLRs may represent….”

      429-430 – The criteria used to define clusters is described in the methods, can you confirm (and mention) that this is the same as used in the analyses you’re comparing to for E. grandis, rice, and Arabidopsis.

      435-437 – I’m interested to know if the four heterogenous clusters contain any of the N-term domain-swapped NLRs

      479-480 – The zf-BED domain is also present in rice NLRs – include citation for Xa1/Xo1

      523-524 – can you specify which base-call model was used on the ONT data?

      I’m curious about the presence/absence of IDs in the analyzed NLRs and would be very curious to know if the authors observe syntenic homologs across the two haploid genomes with ID presence/absence or presence of different IDs polymorphisms.

    1. Raw sequencing data is also in the SRA under bioproject PRJNA955401,

      Nanopublication: RAOk_Yih3v "Organism of Elaphe carinata (species) - observed nucleotide sequence - SRX20564100" https://w3id.org/np/RAOk_Yih3v2q9s4LMZsy1v-qEhZ5ZGceChnl5h-godB2M

    1. Late maturity alpha-amylase (LMA) is a wheat genetic defect causing the synthesis of high isoelectric point (pI) alpha-amylase in the aleurone as a result of a temperature shock during mid-grain development or prolonged cold throughout grain development leading to an unacceptable low falling numbers (FN) at harvest or during storage. High pI alpha-amylase is normally not synthesized until after maturity in seeds when they may sprout in response to rain or germinate following sowing the next season’s crop. Whilst the physiology is well understood, the biochemical mechanisms involved in grain LMA response remain unclear. We have employed high-throughput proteomics to analyse thousands of wheat flours displaying a range of LMA values. We have applied an array of statistical analyses to select LMA-responsive biomarkers and we have mined them using a suite of tools applicable to wheat proteins. To our knowledge, this is not only the first proteomics study tackling the wheat LMA issue, but also the largest plant-based proteomics study published to date. Logistics, technicalities, requirements, and bottlenecks of such an ambitious large-scale high-throughput proteomics experiment along with the challenges associated with big data analyses are discussed. We observed that stored LMA-affected grains activated their primary metabolisms such as glycolysis and gluconeogenesis, TCA cycle, along with DNA- and RNA binding mechanisms, as well as protein translation. This logically transitioned to protein folding activities driven by chaperones and protein disulfide isomerase, as wellas protein assembly via dimerisation and complexing. The secondary metabolism was also mobilised with the up-regulation of phytohormones, chemical and defense responses. LMA further invoked cellular structures among which ribosomes, microtubules, and chromatin. Finally, and unsurprisingly, LMA expression greatly impacted grain starch and other carbohydrates with the up-regulation of alpha-gliadins and starch metabolism, whereas LMW glutenin, stachyose, sucrose, UDP-galactose and UDP-glucose were down-regulated. This work demonstrates that proteomics deserves to be part of the wheat LMA molecular toolkit and should be adopted by LMA scientists and breeders in the future.Competing Interest StatementThe authors have declared no competing interest.

      Reviewer 2. Luca Ermini

      This manuscript, which I had the pleasure of reading, is, simply put, a benchmark of five long read de novo assembly tools. Using 13 real and 72 simulated datasets, the manuscript evaluated the performance of five widely used long-read de novo assemblers: Canu, Flye, Miniasm, Raven, and Redbean.

      Although I find the methodological approach of the manuscript to be solid and trustworthy, I do not think the research is particularly innovative. Long-read assemblers have already been benchmarked in the scientific literature, and similar findings have been made. The authors are aware of this limitation of the study and have added a novel feature: the impact of read length on assembly quality, which in my opinion is still lacking sufficient innovation. However, the manuscript as a whole is valid and worthy of consideration. In light of this, I would like to share some suggestions I made in an effort to make the manuscript unique and more novel.

      Please see my comment below.

      1) Evaluation of the assemblies The metrics used to evaluate an assembly are frequently a murky subject as we are still lacking a standard language. The authors assessed the assemblies using three types of metrics: compass analysis, assembly statistics, and the Busco assessment, in addition to computational metrics like runtime and RAM usage. This is not incorrect, but I would suggest making a clear distinction between the metrics using (in addition to the computational metrics) three widely recognised metrics, or in short, the 3C criterion. The assembly metrics can be broken down into three dimensions: correctness (your compass analysis), contiguity (NG50) and completeness (the BUSCO assessment). The authors should reconsider the text using the 3C criterion; this will provide a clear, understandable, and structured way of categorising metrics. The paragraph beginning at line 197, for example, causes some confusion for the reader. The NG50 metrics evaluate assembly contiguity, whereas the number of misassemblies (considered by the authors in terms of relocation, inversion, and translocation) evaluate assembly correctness. I must admit that the misassemblies and contiguity can overlap, but I would still recommend keeping the NG50 (within contiguity) and misassemblies (within correctness) metrics separate.

      2) Novelty of the comparison The authors of the study had two main goals: to conduct a systematic comparison of five long-read assembly tools (Raven, Flye, Wtdbg2 or Redbean, Canu, and Miniasm) and to determine whether increased read length has a positive effect on overall assembly quality. The authors acknowledge the study's limitations and include an evaluation of the effect of read length on assembly quality as a novel feature of the manuscript (see line 70).

      The manuscript that described the Raven assembler (Vaser, R., Sikic, M. Time- and memory-efficient genome assembly with Raven. Nat Comput Sci 1, 332-336 (2021)) compared the same assemblers' tools (Raven, Flye, Wtdbg2 or Redbean, Canu and Miniasm) evaluated in this manuscript plus two more (Ra and Shasta), used similar eukaryotes (A. thaliana, D. melanogaster, and Human), and reached a similar conclusion on Flye in terms of contiguity (NG50), and completeness (genome fraction) but overall there is not a best assembler in all of the evaluated categories. In this manuscript authors increased the number of eukaryotic genomes (including S. cerevisiae, C. elegans, T. rupribes, and P. falciparum) and reached similar conclusions: there is no assembler that performs the best in all the evaluation categories, but overall Flye is the best-performing assembler. This strengthens the manuscript, but the research is not entirely novel.

      Given that the field of third-generation technologies is rapidly progressing toward the generation of high-quality reads (Pacbio HiFi technology and ONT Q20+ chemistry are achieving accuracy of 99% and higher), the manuscript should also include a HiFi assembler benchmark. This would add novelty to the manuscript and pique the scientific community's interest. The authors have already simulated HiFi reads from S. cerevisiae, P. falciparum, C. elegans, A. thaliana, D. melanogaster, T. rubripes in addition to reference reads (or real reads) from S. cerevisiae (SRR18210286). P. falciparum (SRR13050273) and A. thaliana (SRR14728885).

      Furthermore, I am not sure what the benefit is of evaluating Canu on HiFi data instead of HiCanu, which was designed to deal with HiFi data. The authors already included some HiFi-enabled assemblers like Flye and Wtdbg2 but also HiFiasm should also be considered. I would strongly advise benchmarking the HiFi assemblers to complete the study and add a level of novelty. I would like to emphasise that the manuscript is solid and that I appreciate it; however, I believe that some novelty should be added.

      3) C elegans genomics The now-discontinued RSII, which had a higher error rate and a shorter average read than Sequel I or Sequel II, was used to generate the genomic data from C elegans. I understand the authors' motivation for including it in the analysis, but the use of RSII may skew the comparisons, and I would suggest adding a few sentences to the discussion about it.

      4) CPU time (h) and memory usage The authors claim the benchmark evaluation included runtime and RAM usage. However, I missed finding information about the runtime and RAM usage. Please provide CPU time (h) and memory usage (GB)


      Minor comments:

      1) Lines 64-65 "Here, we provide a comprehensive comparison on de novo assembly tools on all TGS technologies and 7 different eukaryotic genomes, to complement the study of Wick and Holt" I would modify "on all TGS technologies" as "at the present the two main TGS technologies"

      2) Line 163 Real reads. The term "real reads" may cause confusion for readers, leading them to believe that the authors produced the sequencing reads for the manuscript. I would use the term "ref-reads" indicating "reads from the reference genomes"

      3) Lines 218-219 Please provide full names (genus + species): S. cerevisiae, P. falciparum, A. thaliana, D. melanogaster, C. elegans, and T. rubripes

      4) Supplementary Table S4 "Accession number SRR15720446 seems to belong to a sample sequenced with 1 PACBIO_SMRT (Sequel II) rather than ONT

      5) Figures 2 and 3. Figures 2 and 3 give visual results of the performance of the five assemblers. I want to make a few points here: According to what I understand, the top-performing assembler is marked with a star and is plotted with a brighter colour than the others. However, this is not immediately apparent, and some readers might have trouble identifying the colour that has been highlighted. I would suggest either lessening the intensity of the other, lower-performance assemblers or giving the best assembler a graphically distinct outline. I also wonder if it would be useful to give the exact numbers as supplemental tables.

      Re-Review:

      Dear Cosma and colleagues, Thank you so much for addressing my comments in a satisfactory manner. The manuscript, in my opinion, has dramatically improved.

    2. AbstractLate maturity alpha-amylase (LMA) is a wheat genetic defect causing the synthesis of high isoelectric point (pI) alpha-amylase in the aleurone as a result of a temperature shock during mid-grain development or prolonged cold throughout grain development leading to an unacceptable low falling numbers (FN) at harvest or during storage. High pI alpha-amylase is normally not synthesized until after maturity in seeds when they may sprout in response to rain or germinate following sowing the next season’s crop. Whilst the physiology is well understood, the biochemical mechanisms involved in grain LMA response remain unclear. We have employed high-throughput proteomics to analyse thousands of wheat flours displaying a range of LMA values. We have applied an array of statistical analyses to select LMA-responsive biomarkers and we have mined them using a suite of tools applicable to wheat proteins. To our knowledge, this is not only the first proteomics study tackling the wheat LMA issue, but also the largest plant-based proteomics study published to date. Logistics, technicalities, requirements, and bottlenecks of such an ambitious large-scale high-throughput proteomics experiment along with the challenges associated with big data analyses are discussed. We observed that stored LMA-affected grains activated their primary metabolisms such as glycolysis and gluconeogenesis, TCA cycle, along with DNA- and RNA binding mechanisms, as well as protein translation. This logically transitioned to protein folding activities driven by chaperones and protein disulfide isomerase, as wellas protein assembly via dimerisation and complexing. The secondary metabolism was also mobilised with the up-regulation of phytohormones, chemical and defense responses. LMA further invoked cellular structures among which ribosomes, microtubules, and chromatin. Finally, and unsurprisingly, LMA expression greatly impacted grain starch and other carbohydrates with the up-regulation of alpha-gliadins and starch metabolism, whereas LMW glutenin, stachyose, sucrose, UDP-galactose and UDP-glucose were down-regulated. This work demonstrates that proteomics deserves to be part of the wheat LMA molecular toolkit and should be adopted by LMA scientists and breeders in the future.

      This work has been published in GigaScience Journal under a CC-BY 4.0 license (https://doi.org/10.1093/gigascience/giad100), and has published the reviews under the same license. These are as follows.

      **Reviewer 1. Brandon Pickett **

      Overall, this manuscript is well-written and understandable. There's a lot of good work here and I think the authors were thoughtful about how to compare the resulting assemblies. Scripts and models used have been made available for free via GitHub and could be mirrored on or moved to GigaDB if required. I'll include a several minor comments, including some line-item edits, but the bulk of my comments will focus on a few major items.

      Major Comments: My primary concern here is that the comparison is outdated and doesn't address some of the most helpful questions. CLR-only assemblies are no longer state-of-the-art. There are still applications and situations where ONT (simplex, older-pore)-only assemblies are reasonable, but most projects that are serious about generating excellent assemblies as references are unlikely to take that approach.

      Generating assemblies for non-reference situations, especially when the sequencing is done "in the field" (e.g., using a MinION with a laptop) or by a group with insufficient funding or other access to PromethIONs and Sequel/Revios, is an exception to this for ONT-only assemblies. Further, this work assumes a person wants to generate "squashed" assemblies instead of haplotype-resolved or pseudohaplotype assemblies. To be fair, sequencing technology in the TGS space has been advancing so rapidly that it is extremely difficult to keep up, and a sequencing run is often outdated by the time analyses are finished, not to mention by the time a manuscript is written, reviewed, and published.

      Accordingly, in raising my concerns, I am not objecting to the analysis being published or suggesting that the work performed was poor, but I do believe clarifications and discussion are necessary to contextualize the comparison and specify what is missing.

      1. This comparison seeks to address Third-generation sequencing technologies: namely PacBio vs. ONT. However, each company offers multiple kinds of long-read sequencing, and they are not all comparable in the same way. Just as long noisy reads (PacBio CLR & ONT simplex) are a whole new generation from "NGS" short reads like from Illumina, long-accurate reads are arguably a new generation beyond noisy long reads. If this paper wants to include PacBio HiFi reads in the comparison, significant changes are necessary to make the comparison meaningful. I think it's reasonable to drop HiFi reads from this paper altogether and focus on noisy long reads since the existing comparison isn't currently set up to tell us enough about HiFi reads and including them would be an ordeal. If including HiFi, consider the following:

      1.a. Use assemblers designed for long-accurate reads. HiCanu (i.e., Canu with --pacbio-hifi option) is already used, as is a similar approach for Flye and wtdbg2. However, raven is not meant for HiFi data and miniasm is not either (though, it could be done with the correct minimap2 settings, but Hifiasm would be better). Assemblies of HiFi data with Raven and miniasm should be removed. Sidenote – Raven can be run with --weaken (or similar) for HiFi data, but it is only experimental and the parameter has since been removed. Including Hifiasm would be necessary, and it should have been included since Hifiasm was out when this analysis was done. Similarly, including MBG (released before your analysis was done) would be appropriate. Since you'd be redoing the analyses, it would be appropriate to include other assemblers that have since been released: namely LJA. Once could argue that Verkko should be included, but that opens another can of worms as a hybrid assembler (more on that later).

      1b. Use a read simulator that is built for HiFi reads. Badreads is not built for HiFi data (though using custom parameters to make it work for HiFi reads wasn't a bad idea at the time), and new simulators (e.g., PBSIM3, DOI: 10.1093/nargab/lqac092) have since been released that consider the multi-pass process used to generate HiFi data.

      1c. ONT Duplex data is likely not available for the species you've chosen as it is a very new technology. However, you should at least discuss its existence as something for readers to "keep an eye on" as something that is conceptually comparable to HiFi. 1d. Use the latest & greatest HiFi data if possible and at least discuss the evolution of HiFi data. Even better would be to compare HiFi data over time, but this data may not really be available and most people won't be using older HiFi data. Though, simulation of older data would conceivably be possible. While doing so would make this paper more complete, I would argue that it isn't worth the effort at this juncture. For reference, in my observation, older data has a median read length around 10-15 kb instead of 18-22 kb. 1e. Include real Hifi data for the species you are assembling. If none is available and you aren't in a position to generate it, then keep the hifi assembler comparison on real data separate from that of the CLR/ONT assembler comparisons on real data by using real HiFi data for other species. 2. Discuss in the intro and/or discussion that you are focusing on "squashed" assemblies. Without clever sample separation and/or trio-based approaches (e.g., DOI: 10.1038/nbt.4277), a single squashed haplotype is the only possible outcome for PacBio CLR and ONT-only approaches. For non-haploid genomes, other approaches (HiFi-only or hybrid approaches (e.g., HiFi + ONT or HiFi + Hi-C)) can generate pseudohaplotypes at worse and fully-resolved haplotypes at best. The latter is an objectively better option when possible, and it's important to note that this comparison wouldn't apply when planning a project with such goals. Similarly, it would probably be helpful to point out to the novice reader that this comparison doesn't apply to metagenome assembly either. 3. The title suggests to the reader that we'll be shown how long reads makes a difference in assembly compared to non-long read approaches. However, this is not the case, despite some mention of it in near line 318. Short read assemblies are not compared here and no discussion is provided to suggest how long read-based assemblies would improve outcomes in various situations relative to short reads. Unless such a comparison and/or discussion is added, I think the title should be changed. I've included this point in the "Major Comments" section because including such a comparison would be a big overhaul, but I don't expect this to be done. The core concern is that the analysis is portrayed correctly. 4. Sequencing technologies are often portrayed as static through time, but this is not accurate. This is a failing of the field generally. Part of the problem is the length of the publishing cycle (often >1yr from when a paper is written to when it's published, not to mention how long it takes to do the analysis before a paper is even written). Part of the problem is that current statistics are often cited in influential papers and then recited in more recent papers based on the influential paper despite changes having been made since that influential paper was released. Accordingly, the error rate in ONT reads has been misreported as being ~15% for many years even though their chemistry has improved over time and the machine learning models (especially for human samples) have also improved, dropping the error rate substantially. ONT has made improvements to their chemistry and changed nanopores over time and PacBio has tinkered with their polymerase and chemistry too. Accordingly, a better question for a person planning to perform an assembly would be "which assembler is best for my datatype (pacbio clr vs ont) and chemistry/etc.?" instead of just differentiating by company. Any comparison of those datatypes should at least address this as a factor in their discussion, if not directly in their analysis. I feel that this is missing from this comparison. In an ideal world, we'd have various CLR chemistries and ONT pores/etc. for each species in this analysis. That data likely doesn't exist for each of the chosen species though, and generating it would be non-trivial, especially retroactively. Using the most recent versions is a good option, but may also not exist for every species chosen. Since this analysis was started (circa Nov/Dec 2021 by my estimate based on the chosen assembler versions), ONT has released pore 10; in combination with the most recent release of Guppy, error rates <=3% are expected for a huge portion of the data. That type of data is likely to assemble very differently from R9.4, and starker differences would be expected for data older than R9.4. Even if all the data were the most recent (or from the same generation (e.g., R9.4)), library preps vary greatly, especially between UL (ultralong) libraries and non-UL libraries. Having reads >100kb, especially a large number of them, makes a big difference in assembly outcome in my observation. How does choice of assembler (and possibly different parameters) affect the assembly when UL data is included? How is that different from non-UL data? What about UL data at different percentages of the reads being considered UL? A paper focusing on long noisy reads would be much more impactful if it addresses these questions. Again, this may not be possible for this particular paper considering what's already been done and the available funding, and I think that's okay. However, these issues need to addressed in the discussion as open questions and suggested future work. The type of CLR and ONT data also needs to be specified in this work, e.g., in a supplemental table, and if the various datasets are not from the same types, these differences need to be acknowledged. At a minimum, I think the following data points should be included: chemistry/pore information (e.g., R9.4 for ONT or P2/C5 for PacBio), basecaller (e.g., guppy vX.Y.Z), and read length distribution info (e.g., mean, st. dev., median, %>100kb), ideally a plot of the distribution in addition to summary values. I also understand that these data were generated previously by others, and this information should theoretically be available from their original publications, which are hopefully accessible via the INSDC records associated with the provided accessions. The objective here is making the information easily accessible to the readers of this paper because those could be confounding variables in the analysis.

      1. This comparison considered only a single coverage level (30x). That's not an unreasonable shortcut, but it certainly leaves a lot of room for differences between assemblers. If the objective the paper is to help future project planners decide what assembler to use, it would be most helpful if they had an idea of what coverage they can use and still succeed. That's especially true for projects that don't have a lot of funding or aren't planning to make a near-perfect reference genome (which would likely spend the money on high coverage of multiple datatypes). It would be helpful to include some discussion about how these results may be different at much lower (e.g., 2x or 10x coverage) or at higher coverage (e.g., 50x, 70x, etc.) and/or provide some justification from another study for why including that kind of comparison would be unlikely to be worthwhile for this study, even if project planners should consider those factors when developing their budget and objectives.
      2. Figure 2 and 3 include a lot of information, and I generally like how they look and that they provide a quick overview. I believe two things are missing that will improve either the assessment or the presentation of the information, and I think one change will also improve things. 6a. I think metrics from Merqury (DOI: 10.1186/s13059-020-02134-9) should be included where possible. Specifically, the k-mer completeness (recovery rate) and reference-free QV estimate (#s 1 and 3 from https://github.com/marbl/merqury/wiki/2.-Overall-k-mer-evaluation). Generally these are meant to be done from data of the same individual. However, most of the species selected for this comparison are highly homozygous strains that should have Illumina data available, and thus having the data come from not the exact some individual will likely be okay. This can serve as another source of validation. If such a dataset is not available for 1 or more of these species, then specify in the text that it wasn't available, and thus such an evaluation wasn't possible. If it's not possible to add one or both of these metrics to the figures (2 & 3), that's fine, but having it as a separate figure would still be helpful. I find these values to be some of the most informative for the quality of an assembly. 6b. It's not strictly necessary, so this might be more of a minor comment, but I found that I wanted to view individual plots for each metric. Perhaps including such plots in the supplement would help (e.g., 6 sets of plots similar to figure 4 with color based on assembler, grouping based on species, and opacity based on datatype). The specifics aren't critical, I just found it hard to get more than a very general idea from the main figures and wanted something easy to digest for each metric. 6c. Using N50/NG50 for a measure of contiguity is an outdated and often misleading approach. Unfortunately, it's become such common practice that many people feel obligated to include it or use it. Instead, the auN (auNG) would be a better choice for contiguity: https://lh3.github.io/2020/04/08/a-new-metric-on-assembly-contiguity.
      3. This paper focuses on assembly and intentionally does not consider polishing (line 176), which I think is a reasonable choice. It also does not consider scaffolding or hybrid assembly approaches (again, reasonable choices). In the case of hybrid assembly options, most weren't available when this analysis was done (short read + long read assemblers were available, but I think it's perfectly reasonable to not have included those). Given the frequency of scaffolding (especially with Hi-C data [DOIs:10.1371/journal.pcbi.1007273 & 10.1093/bioinformatics/btac808]) and the recent shift to hybrid assemblers (e.g., phasing HiFi-based string graphs using Hi-C data to get haplotype resolved diploid assemblies (albeit with some switch errors) [DOI: 10.1038/s41587-022-01261-x] or resolving HiFi-based minimizer de bruijn graphs using ONT data and parental Illumina data to get complete, T2T diploid assemblies [DOI: 10.1038/s41587-023-01662-6]), I think it would be appropriate to briefly mention these methods so the novice reader will know that this benchmark does not apply to hybrid approaches or post-assembly genome finishing. This is a minor change, but I included it in this section because it matches the general theme of ensuring the scope of this benchmark is clear.

      Minor Comments: 1. line 25 in the abstract. Change Redbean to wtdbg2 for consistency with the rest of the manuscript.

      1. "de novo" should be italicized. It is done correctly in some places but not in others.

      2. line 64. "all TGS technologies": I would argue that this isn't quite true. ONT Duplex isn't included here even though Duplex likely didn't exist when you did this work. Also, see the major comments concerning whether TGS should include HiFi and Duplex.

      3. Table 1. Read length distributions vary dramatically by technology and library prep. E.g., HiFi is often a very tight distribution about the mean because of size selection. Including the median in the table would be helpful, but more importantly, I would like to see read-length distribution plots in the supplement for (a) the real data used to generate the initial iteration models and (b) the real data from each species.

      4. line 166 "fair comparison". I'm not sure that a fair comparison should be the goal, but having them at the same coverage level makes them more comparable which is helpful. Maybe rephrase to indicate that keeping them at the same coverage level reduces potentially confounding variables when comparing between the real and simulated datasets.

      5. line 169. Citation 18 is used for Canu, which is appropriate but incomplete. The citation for HiCanu should also be included here: DOI: 10.1101/gr.263566.120.

      6. line 169. State that these were the most current releases of the various assemblers at the time that this analysis was started. Presumably, that was Nov/Dec 2021. Since then, Raven has gone from v1.7.0->1.8.1 and Flye has gone from v2.9->2.9.1.

      7. line 175. Table S6 is mentioned here, but S5 has not yet been mentioned. S5 is mentioned for the first time on line 196. These two supp tables' numbers should be swapped.

      8. There is inconsistent use of the Oxford comma. I noticed is missing multiple times, e.g., lines 191, 208, 259, & 342.

      9. line 193. The comma at the end of the line (after "tools") should be removed. Alternatively, keep the comma but add a subject to the next clause to make it an independent clause (e.g., "...assembly tools, and they were computed...").

      10. line 237. The N50 of the reference is being used here. You provide accessions for the references used, but most people will not go look those up (which is reasonable). The sequences in a reference can vary greatly in their lengths, even within the same species, because which sequences are included in the reference are not standardized. Even the size difference between a homogametic and heterogametic reference can be non-trivial. Which are included in the reference, and more importantly included in your N50 value, can significantly change the outcome and may bias results if these are not done consistently between the included species. It would be helpful if here or somewhere (e.g., in some supplemental text or a table) the contents of these references was somehow summarized. In addition to 1 copy of each of the expected autosomes, were any of the following included: (a) one or two sex chromosomes if applicable, (b) mitochondrial, chloroplast, or other organelle sequences, (c) alternate sequences (i.e., another copy of an allele of some sequence included elsewhere), (d) unplaced sequence from the 1st copy, (e) unplaced sequence from subsequent copies, and (f) vectors (e.g., EBV used when transforming a cell line)?

      11. Supplemental tables. Some cells are uncolored, and other cells are colored red or blue with varying shading. I didn't notice a legend or description of what the coloring and shading was supposed to mean. Please include this either with each table or at the beginning of the supplemental section that includes these tables and state that it applies to all tables #-#.

      12. Supplemental table S3. It was not clear to me that you created your own model for the hifi data (pacbio_hifi_human2022). I was really confused when I couldn't find that model in the GitHub repo for Badreads. In the caption for this table or in the text somewhere, please make it more explicit that you created this yourself instead of using an existing model.

    1. AbstractBackground Machine learning (ML) technologies, especially deep learning (DL), have gained increasing attention in predictive mass spectrometry (MS) for enhancing the data processing pipeline from raw data analysis to end-user predictions and re-scoring. ML models need large-scale datasets for training and re-purposing, which can be obtained from a range of public data repositories. However, applying ML to public MS datasets on larger scales is challenging, as they vary widely in terms of data acquisition methods, biological systems, and experimental designs.Results We aim to facilitate ML efforts in MS data by conducting a systematic analysis of the potential sources of variance in public MS repositories. We also examine how these factors affect ML performance and perform a comprehensive transfer learning to evaluate the benefits of current best practice methods in the field for transfer learning.Conclusions Our findings show significantly higher levels of homogeneity within a project than between projects, which indicates that it’s important to construct datasets most closely resembling future test cases, as transferability is severely limited for unseen datasets. We also found that transfer learning, although it did increase model performance, did not increase model performance compared to a non-pre-trained model.Competing Interest StatementThe authors have declared no competing interest.

      **Reviewer 2. Luke Carroll **

      The paper applies machine learning to publicly available proteomics data sets and assesses the ability to transfer learning algorithms between projects. The primary aim of these algorithms appears to be an attempt to increase consistency of retention time prediction for data-dependent acquisition data sets, however this is not explicitly stated within the text. The application of machine learning to derive insight from previous performed proteomics experienced is a worthwhile exercise.

      1. The authors report ΔRT to determine fitting for their models. It would be interesting to see whether the models had other metrics used to assess fitting, or could be used to increase number of identifications within sample sets, and whether this was successful. ALternatively, was there any conclusions able to be drawn about peptide structure and RT determination from these models?

      2. Project specific libraries are well known to improve results compared with publicly available databases, and the discussion on this point should be developed further through comparison of this work with other papers - particularly with advances in machine learning and neural networks in the data independent analysis field.

      3. Comparison of Q-Exactiv models vs Orbitraps appears to be somewhat redundant, and possible a result of poor meta-data as Q-Exactiv instruments are orbitrap mass spectrometers. A more interesting comparison to make here would be between orbitrap and TOF instruments, though as the datasets have all been processed through MaxQuant, it is likely the vast majority were acquired on orbitrap instruments.

      4. The paper uses ΔRT as the readout for all models tested, however the only chromatography variable considered in testing the models is gradient length. However, chromatography is also dependent on column chemistry, column dimensions, composition of buffer, use of traps, temperature etc. These are also likely to be contributing the variance observed between the PT datasets where these variables will be consistent and publicly available datasets. These factors are also likely to play a role in higher uncertainty for early and late eluting peptides where these factors are likely to vary most between sample sets. The metadata may not be available to use to compare within the data sets selected, so the authors should at minimum make discussion around these points.

      5. Sample preparation is likely to have similar effects, and as the PT datasets are generated synthetically using ideal peptides, publicly available datasets will be generated from complex sample mixtures, and have increased variance due to inefficiencies of digestion, sample clean up and matrix effects. Previous studies on variance have also described sample preparation as the highest cause of variance. This needs further discussion

      6. While the isolation windows of the m/z will lead to unobserved space, search engines setting will also apply here. From the text, it appears that the only spectra that were considered were those already identified in a search program (due to having Andromeda cut-off scores always apply). Typical setting for a database search will have a cut off of peptide sequences of at least 7 residues, making peptide masses appearing lower than 350 m/z unlikely. There is also significant amount of noise below 350 m/z and this also likely contributes to poorer fitting.

      7. The authors identify differences in MSMS spectral features, however, most of these points are well known in the field. The authors should develop the discussion on the causes of the differences in fragmentation, as CID low mass drop off is expected, and the change in profile is expected with increasing activation energies. A more developed analysis could exclude precursor masses from these plots and focus solely on fragment ions generated.

      8. The authors highlight that internal fragmentation of peptides could be used as a valuable resource to implement in machine learning. There has already been some success using these fragmentation patterns for sequence identification within both top-down and bottom up proteomic searches that the authors should consider discussing. However, these data do not appear to be incorporated into the machine learning models in this paper - or at least seem not to play a significant role in prediction, and this section appears to be a bit out of place.

      Re-Review The changes and additions to the discussion for the paper address the key points, and have addressed some of the limitations imposed by the availability and ability to extract certain data elements particularly around sample preparation and LC settings. I think this strengthens their manuscript, and provides a more wholistic discussion of factor in the experimental setup.

    2. Background Machine learning (ML) technologies, especially deep learning (DL), have gained increasing attention in predictive mass spectrometry (MS) for enhancing the data processing pipeline from raw data analysis to end-user predictions and re-scoring. ML models need large-scale datasets for training and re-purposing, which can be obtained from a range of public data repositories. However, applying ML to public MS datasets on larger scales is challenging, as they vary widely in terms of data acquisition methods, biological systems, and experimental designs.Results We aim to facilitate ML efforts in MS data by conducting a systematic analysis of the potential sources of variance in public MS repositories. We also examine how these factors affect ML performance and perform a comprehensive transfer learning to evaluate the benefits of current best practice methods in the field for transfer learning.Conclusions Our findings show significantly higher levels of homogeneity within a project than between projects, which indicates that it’s important to construct datasets most closely resembling future test cases, as transferability is severely limited for unseen datasets. We also found that transfer learning, although it did increase model performance, did not increase model performance compared to a non-pre-trained model.

      This work has been published in GigaScience Journal under a CC-BY 4.0 license (https://doi.org/10.1093/gigascience/giad096), and has published the reviews under the same license. These are as follows.

      **Reviewer 1: Juntao Li **

      This paper aimed to facilitate machine learning efforts in mass spectrometry data by conducting a systematic analysis of the potential sources of variance in public mass spectrometry repositories. This paper examined how these factors affect machine learning performance and performed a comprehensive transfer learning to evaluate the benefits of current best practice methods in the field for transfer learning. Although the experimental content is extensive and provides promising results, some major points need to be addressed as follows:

      1.Please explain the rationality of the RT used for evaluating model performance. In addition, it is necessary to increase other evaluation metrics to provide a more powerful comparison of model performance.

      2.The curves in Figures 6 and 8 should provide more explanations to help readers understand. In addition, all figures are somewhat blurry and clearer figures should be provided.

      3.This paper does not provide specific implementation steps of variance. Please describe the variance analysis process in mathematical language and provide the corresponding mathematical formula.

      4.There are some formatting issues: Keywords and the title 'Data Description' should only have the first letter capitalized. On pages 6, 17, and 18, the font size of the article is inconsistent.

      5.There are some grammar issues: On pages 6 and 16, dataset should be added with 's'. On page 7, lines 9-10, the tense is not unified.

      6.There are significant issues with the format of references. Inconsistent capitalization of initial letters in literature titles, such as [1] and [5]; Some literature lacks page numbers, such as [6] and [18]. Please re- organize the references according to the format required by the journal.

      Re-Review:

      I am glad to see that the authors have revised the manuscript based on the reviewer's comments and improved its quality. However, the responses to some comments did not fully convince me. I suggest the authors further revise or explain the following issues.

      1. I agree the rationality of ΔRT as a performance measure, but does not agree with the author's viewpoint of 'However, as the model performance indicates metric variance, and there are no changes to the conclusions drawn from the model performance'. I suggest the authors truthfully provide other classic machine learning performance metrics on the test dataset and analyze the differences.

      2. In order to avoid randomness caused by single data partitioning (training and testing data partitioning), multiple random data partitioning strategie (100 or 50 times) is usually adopted to evaluate the performance of learners using multiple average performance measures and variance. It is recommended that the authors consider this issue.

      3. The structure and references of the papers that I have seen that have been officially published in GigaScience are very different from the manuscript (the author has claimed to have organized and written according to the requirements). I am not sure if it was my mistake or the authors' mistake. I suggest the authors confirm the issue again and improve the writing.

    1. Background Assembly algorithm choice should be a deliberate, well-justified decision when researchers create genome assemblies for eukaryotic organisms from third-generation sequencing technologies. While third-generation sequencing by Oxford Nanopore Technologies (ONT) and Pacific Biosciences (PacBio) have overcome the disadvantages of short read lengths specific to next-generation sequencing (NGS), third-generation sequencers are known to produce more error-prone reads, thereby generating a new set of challenges for assembly algorithms and pipelines. Since the introduction of third-generation sequencing technologies, many tools have been developed that aim to take advantage of the longer reads, and researchers need to choose the correct assembler for their projects.Results We benchmarked state-of-the-art long-read de novo assemblers, to help readers make a balanced choice for the assembly of eukaryotes. To this end, we used 13 real and 72 simulated datasets from different eukaryotic genomes, with different read length distributions, imitating PacBio CLR, PacBio HiFi, and ONT sequencing to evaluate the assemblers. We include five commonly used long read assemblers in our benchmark: Canu, Flye, Miniasm, Raven and Redbean. Evaluation categories address the following metrics: reference-based metrics, assembly statistics, misassembly count, BUSCO completeness, runtime, and RAM usage. Additionally, we investigated the effect of increased read length on the quality of the assemblies, and report that read length can, but does not always, positively impact assembly quality.Conclusions Our benchmark concludes that there is no assembler that performs the best in all the evaluation categories. However, our results shows that overall Flye is the best-performing assembler, both on real and simulated data. Next, the benchmarking using longer reads shows that the increased read length improves assembly quality, but the extent to which that can be achieved depends on the size and complexity of the reference genome.Competing Interest StatementThe authors have declared no competing interest.

      **Reviewer 2: Katharina Scherf ** General comments This paper is a very thorough report on large-scale proteomics mapping of ca. 4000 wheat samples and several challenges related to sample preparation, measurement and data analysis. It is the first paper reporting such an extensive dataset and tools for analysis. Overall, I think that the authors have done in-depth work and it is also described in a way that can be understood well. The descriptions of how the authors arrived at the final workflow will also be useful to other groups attempting to do proteomics of wheat or other grains. I have only few comments for improvement. Note: line numbers would have been helpful

      Specific comments Abstract - Results: "LMA expression greatly impacted grain starch and other carbohydrates …" and then alpha-gliadins and LMW glutenin is mentioned. However, these are proteins and their relation to starch/carbohydrates is not clear.

      Introduction overall: Please harmonize the use of alpha-amylase and a-amylase; alpha-amylase is recommended, or else the Greek letter.

      p3, L1: "great source of protein": In terms of quantity, this is true. However, you should also include a brief statement about protein quality, which is not ideal, especially when considering gluten proteins

      section 2.1: Please include if all samples were grown together at the same place in one year (or not); i.e. include the information from section 3.1.1 already here.

    2. AbstractBackground Assembly algorithm choice should be a deliberate, well-justified decision when researchers create genome assemblies for eukaryotic organisms from third-generation sequencing technologies. While third-generation sequencing by Oxford Nanopore Technologies (ONT) and Pacific Biosciences (PacBio) have overcome the disadvantages of short read lengths specific to next-generation sequencing (NGS), third-generation sequencers are known to produce more error-prone reads, thereby generating a new set of challenges for assembly algorithms and pipelines. Since the introduction of third-generation sequencing technologies, many tools have been developed that aim to take advantage of the longer reads, and researchers need to choose the correct assembler for their projects.Results We benchmarked state-of-the-art long-read de novo assemblers, to help readers make a balanced choice for the assembly of eukaryotes. To this end, we used 13 real and 72 simulated datasets from different eukaryotic genomes, with different read length distributions, imitating PacBio CLR, PacBio HiFi, and ONT sequencing to evaluate the assemblers. We include five commonly used long read assemblers in our benchmark: Canu, Flye, Miniasm, Raven and Redbean. Evaluation categories address the following metrics: reference-based metrics, assembly statistics, misassembly count, BUSCO completeness, runtime, and RAM usage. Additionally, we investigated the effect of increased read length on the quality of the assemblies, and report that read length can, but does not always, positively impact assembly quality.Conclusions Our benchmark concludes that there is no assembler that performs the best in all the evaluation categories. However, our results shows that overall Flye is the best-performing assembler, both on real and simulated data. Next, the benchmarking using longer reads shows that the increased read length improves assembly quality, but the extent to which that can be achieved depends on the size and complexity of the reference genome.

      This work has been published in GigaScience Journal under a CC-BY 4.0 license (https://doi.org/10.1093/gigascience/giad084), and has published the reviews under the same license. These are as follows.

      **Reviewer 1: Nobuaki Takemori **

      The large proteome dataset for wheat, a representative grain, presented in this manuscript is valuable not only for agriculture science but also for basic plant science, but unfortunately, the manuscript is too wordy in its description and informative. Of course, a detailed description of the experimental methods and data generation process is an important component in obtaining reproducibility, but excessive information in the main text may have the unintended effect of hindering the reader's understanding of the manuscript. The volume of the main text in this manuscript should be reduced to 1/2 or even 1/3 of the original by referring to the following suggested revisions.

      Title: It looks rather like the title of a review article and is not appropriate for the title of an original research paper. An abbreviation is also used, making it difficult to understand. It should be changed to a title that more specifically and pragmatically reflects the content of the paper.

      Materials and Methods 2.3: The sample pretreatment used in this experiment has already been described in Ref. 41, so detailed description in this text is unnecessary. Also, Figure 1, which visualizes the experimental process, is too packed with information and is difficult to read in its small font. In addition, many extraneous photographs of LC-MS instruments and other common equipment are included. Sample pretreatment should be described very briefly in the text, and only those areas where there are differences from previous reports should be mentioned. If the author wishes to describe the details of the experiment to assure reproducibility, it is recommended to describe it in the form of an experimental protocol and include it in the Supplementary Information.

      Materials and Methods 2.5: The 11 different paths the authors have set up for LC-MS/MS analysis are difficult to understand in text. Maybe they could be summarized in a table or visualized using a flowchart.

      Materials and Methods 2.6 to 2.9: It is recommended that only the essentials be described in the text and the minute details be moved to the Supplementary Information.

      Results 3.2.(p 26, line 11-20): The description should be moved to the introduction.

      Results 3.1.3-3.1.4 Too detailed and too long. Only the main points should be mentioned. It would be effective to use concise Figures where possible.

      Figure 6: Too much information; A, B, F, and G should be supplemental information.

      Figure 8: Wheat cartoon is unnecessary. The font is too small. This information should be in a Table.

  8. Dec 2023
    1. Editors Assessment: Antimicrobial resistance (AMR) is a global public health threat, and environmental microbial communities can act as reservoirs for resistance genes. There is a need for genomic surveillance could provide insights into how these reservoirs change and impact public health. With that goal in mind this study tested the ability of nanopore sequencing and adaptive sampling to enrich for AMR genes in a mock community of environmental origin. On average adaptive sampling resulting in a target composition 4x higher than without adaptive sampling, and increased target yield in most replicates. The methods and scripts for this approach were reviewed and curated together, although the scope of this study was limited in terms of communities tested and AMR genes targeted. And the authors improved their analysis by conducting an additional analysis of a diverse microbial community. Demonstrating the method is reusable and its results are promising for developing a flexible, portable, and cost-effective AMR surveillance tool.

      *This evaluation refers to version 1 of the preprint *

    2. AbstractAntimicrobial resistance (AMR) is a global public health threat. Environmental microbial communities act as reservoirs for AMR, containing genes associated with resistance, their precursors, and the selective pressures to encourage their persistence. Genomic surveillance could provide insight into how these reservoirs are changing and their impact on public health. The ability to enrich for AMR genomic signatures in complex microbial communities would strengthen surveillance efforts and reduce time-to-answer. Here, we test the ability of nanopore sequencing and adaptive sampling to enrich for AMR genes in a mock community of environmental origin. Our setup implemented the MinION mk1B, an NVIDIA Jetson Xavier GPU, and flongle flow cells. We observed consistent enrichment by composition when using adaptive sampling. On average, adaptive sampling resulted in a target composition that was 4x higher than a treatment without adaptive sampling. Despite a decrease in total sequencing output, the use of adaptive sampling increased target yield in most replicates.

      This work has been published in GigaByte Journal under a CC-BY 4.0 license (https://doi.org/10.46471/gigabyte.103), and has published the reviews under the same license. These are as follows.

      **Reviewer 1. Ned Peel. **

      Is the source code available, and has an appropriate Open Source Initiative license been assigned to the code?

      Yes. I do not think the authors have included a specific license and assume the code will be released under a Creative Commons CC0 waiver.

      As Open Source Software are there guidelines on how to contribute, report issues or seek support on the code?

      No. No guidelines on how to contribute, report issues or seek support on the code.

      Is there a clearly-stated list of dependencies, and is the core functionality of the software documented to a satisfactory level?

      Yes. A list of software used, along with version numbers, can be found in "dart_methods_notebook.md"

      Additional Comments:

      The authors describe each step of the analysis well and have provided code to reproduce the analysis and figures from the manuscript.

      **Reviewer 2. Julian Sommer **

      Is there a clear statement of need explaining what problems the software is designed to solve and who the target audience is?

      No. Not applicable to this study, since no novel software is described.

      Is the source code available, and has an appropriate Open Source Initiative license been assigned to the code?

      Not applicable to this study, since no novel software is described.

      As Open Source Software are there guidelines on how to contribute, report issues or seek support on the code?

      No. Not applicable to this study, since no novel software is described.

      Is the code executable?

      Unable to test. The code and software used for analysis of the data is reported in the supplement data. However, the data used in this study in the SRA biobank is not available to download at the time of this review.

      Is installation/deployment sufficiently outlined in the paper and documentation, and does it proceed as outlined?

      Unable to test. See above.

      Is the documentation provided clear and user friendly?

      Yes. The analysis steps are clearly commented.

      Is there enough clear information in the documentation to install, run and test this tool, including information on where to seek help if required?

      No. The code provided for the data analysis is not usable without the raw sequencing data.

      Have any claims of performance been sufficiently tested and compared to other commonly-used packages?

      Not applicable.

      Additional Comments.

      The aim of this study was to test the ability of adapting sampling sequencing on the Oxford Nanopore sequencer to enrich for antibiotic resistance genes in a synthetic mixture of bacterial DNA. DNA from six environmental bacterial isolates with known antibiotic resistance genes were mixed at equal mass and used for metagenomic sequencing on an Oxford Nanopore MinION MK1B, comparing adaptive sampling with standard sequencing. By analysing 10 sequencing runs using low throughput, low cost flongle flow cells, the authors obtained sequencing data to compare adaptive sampling and standard sequencing approaches. Using a defined composition of sequenced sample and technical and biological replicates, the method is generally suitable. From their data, the authors conclude that adaptive sequencing significantly reduces throughput and increases gene target enrichment by analysing different parameters.

      This result is important for the use of adaptive sampling in general, but has already been published in numerous publications, the author cites in his study. According to the author, the novel aspect of this work is the environmental origin of the bacteria used to generate the synthetic mock community. However, since the approach of adaptive sampling does not change regardless of the origin of the sequenced DNA, there are no significant new insights generated in this study. Also, the synthetic mock community of six members does not resemble an environmental metagenomic sample with incomparably more complex species diversity with different abundances. From the data presented in this study, no conclusions can be drawn regarding the performance of adaptive sampling sequencing of environmental metagenomic samples.

      To improve the study, I suggest the following: Sequencing of DNA from environmental samples using nanopore sequencing without adaptive sampling and identification of antibiotic resistance genes. Subsequently, resequencing the sample using adaptive sampling based on the identified antibiotic resistance genes and comparing the results in terms of gene target enrichment as analysed in the study. This was partly suggested by the authors and should be carried out to gain new insights into the very interesting application of metagenomic sequencing for the One Health approach.

      Additionally, there are some inconsistencies in the manuscript. For example, line 128 – 132 describes the sequencing process using different flowcells and technical replicates. However, it remains unclear, how the half of the channels of each flowcell were reserved for adaptive sampling sequencing since the adaptive sampling sequencing is always performed on the whole flowcell. Additionally, it is stated, that each flowcell was used twice for sequencing, however, no method on how to reuse the flongle flowcells is described and no protocol for this is available from oxford nanopore.

    1. The genome assembly and annotation of the Chinese cobra, Naja atra

      Nanopublication: RAyW5v4w76 "Article: The genome assembly and annotation of the Chinese cobra, Naja atra" https://w3id.org/np/RAyW5v4w76mcFJYDreFTuhc4Yu0sKwZQBccYfoB_Q-7_o

    2. Raw reads are available in the SRA via bioproject PRJNA955401. Additional data is in the GigaDB repository [25  Reference25WangJ, WuY, WangS Supporting data for “The genome assembly and annotation of the Chinese cobra, Naja atra”. GigaScience Database, 2023; http://dx.doi.org/10.5524/102476 .].

      Nanopublication: RAt6pmOk9T "Organism of ?term=txid8656 - sequenced nucleotide sequence - PRJNA955401" https://w3id.org/np/RAt6pmOk9T4pCGTI5HTJ3hntFoIWRNv5zpGSNxX0JTYVk

  9. Nov 2023
    1. Editors Assessment:

      The hairy vetch Vicia villosa is an annual legume widely used as a cover crop due to its ability to withstand harsh winters. Here a new a 2.03GB reference-quality genome is presented, assembled from PacBio HiFi long-sequence reads and Hi-C scaffolding. After adding some more methodological details and long-terminal repeat (LTR) assembly index (LAI) analysis the assembly quality and metrics look quite convincing as a chromosome-scale assembly. This resource hopefully providing the foundation for a genetic improvement program for this important cover crop and forage species.

      This evaluation refers to version 1 of the preprint

    2. ABSTRACTVicia villosa is an incompletely domesticated annual legume of the Fabaceae family native to Europe and Western Asia. V. villosa is widely used as a cover crop and as a forage due to its ability to withstand harsh winters. A reference-quality genome assembly (Vvill1.0) was prepared from low error rate long sequence reads to improve genetic-based trait selection of this species. The Vvill1.0 assembly includes seven scaffolds corresponding to the seven estimated linkage groups and comprising approximately 68% of the total genome size of 2.03 gigabase pairs (Gbp). This assembly is expected to be a useful resource for genetic improvement of this emerging cover crop species as well as to provide useful insights into plant genome evolution.

      This work has been published in GigaByte Journal under a CC-BY 4.0 license (https://doi.org/10.46471/gigabyte.98), and has published the reviews under the same license. These are as follows.

      Reviewer 1. Rong Liu

      See reviewer comments document: https://gigabyte-review.rivervalleytechnologies.com/journal/gx/download-files?YXJ0aWNsZT0zODcmZmlsZT0xNTAmdHlwZT1nZW5lcmljJnZpZXc9ZmFsc2U~

      Reiewer 2. Haifei Hu

      Fuller et al. conducted an interesting work on the Vicia villosa genome study, which could be beneficial for the science community. However, there are some concerns about this work before it can be published.

      1. Introduction The MS seems to indicate the V.villosa genome is important for breeding, and it is an ideal legume that can grow in winter. But the coming analysis and results are missing to address this. The authors should include additional analysis, at least in the gene annotation session, to indicate what genes are potentially associated with the improvement of genetic-based selection and the ability to grow in winter conditions. After reading the MS, it looks like it mainly focuses on the comparison of the V.vilsoa genome and the V.sativa genome. Please indicate why it is important to do so and provide more background on V.sativa in the introduction. Line 59. It is too sudden to start to describe high heterozygosity as still in the challenge without directly linking to V.villosa. The authors need to include the background that V.villosa is heterozygous first, then talk about how challenging it is to generate an assembly.

      2. Methods Line 112: Why is the estimation based on K-mer size quite different from the generated assembly size? The authors’ explanation is weak and needs an in-depth and better explanation of these unexpected results. Did you see any similar observations in other studies? Please give examples(citations). Line 121: Any reason not to use the commonly used HiFi assembler HFi-asm? Line 142-143: Did you have a file to record which genome regions you have introduced the breaks and how this step was performed? Line 158: the unit bp changed into Mb for better comparison Line 160: Here, you should use contig N50 rather than scaffold N50 to indicate the quality of the gnome. And you need to compare the contig N50 with the V.sativa.

      3. DATA VALIDATION AND QUALITY CONTROL Should perform BUSCO and LAI to assess the quality of the genome in the main text.

      4 Phylogenetic tree construction Soybean is an important legume species, and it will make this result more useful and interesting for readers. You should include the Wm82 V4 genome for this analysis. And the version of other legume species’ genomes needs to be indicated.

      5 Figures Figure 3 HiC alignment map shows near 600Mb genomes can not be scaffolded into a genome. Any reason? What is the green dot point in the figure? Figure 4 b, the BUSCO of Vvil1.0 is much higher than V.stativa. Any reason? And no description of how you perform the BUSCO analysis in the main text. Figure 6 Circle plot, would that possible to rename the scaffold as a chromosome based on the alignment between V.sativa and V.vil?

    1. Editors Assessment: Aedes mosquito spread Arbovirus epidemics (e.g. Chikungunya, dengue, West Nile, Yellow Fever, and Zika), are a growing threat in Africa but a lack of vector data limits our ability to understand their propagation dynamics. This work describes the geographical distribution of Ae. aegypti and Ae. albopictus in Kinshasa, Democratic Republic of Congo between 2020 and 2022. Sharing 6,943 observations under a CC0 waiver as a Darwin Core archive in the University of Kinshasa GBIF database. Review improved the metadata by adding more accurate date information, and this data can provide important information for further basic and advanced studies on the ecology and phenology of these vectors in West Africa.

      This evaluation refers to version 1 of the preprint

    2. AbstractArbovirus epidemics (e.g. Chikungunya, dengue, West Nile, Yellow Fever, and Zika), are a growing threat in Africa in areas where Aedes (Ae.) aegypti and A. albopictus are present.The lack of complete sampling of these two vectors limits our ability to understand their propagation dynamics in areas at risk from arboviruses. Here, we describe for the first time the geographical distribution of two arbovirus vectors (Ae. aegypti and Ae. albopictus) in a chikungunya post-epidemic zone in the provincial city of Kinshasa, Democratic Republic of Congo between 2020 and 2022. In total 6,943 observations were reported using larval capture and human capture on landing methods. These data are published in the public domain as a Darwin Core archive in the Global Biodiversity Information Facility. The results of this study potentially provide important information for further basic and advanced studies on the ecology and phenology of these vectors, as well as on vector dynamics after an epidemic period.Subject Areas Ecology, Biodiversity, Taxonomy

      This work has been published in GigaByte Journal under a CC-BY 4.0 license (https://doi.org/10.46471/gigabyte.98), and has published the reviews under the same license. These are as follows.

      **Reviewer 1. Luis Acuña-Cantillo **

      Are the data and metadata consistent with relevant minimum information or reporting standards?

      They must be review the standard Darwin core format for sampling events. https://www.gbif.org/darwin-core.

      Is there sufficient detail in the methods and data-processing steps to allow reproduction?

      No. They don't describe how the map of the study area was created, whether they used a GIS or not. Sampling points must be included on the map.

      They don't mention how the identification of the larval stages was carried out and how they were differentiated from other genera of species of the Culicinae subfamily, such as Culex, Haemagogus, Mansonia, Sabethes or other species of the genus Aedes, since the two main species of this genus, were its objective.

      In 5 reference, they mention is only for adult identification. They should include or cite the collection protocols and describe them as much as possible so that the study can be replicated in other African countries.

      Is there sufficient data validation and statistical analyses of data quality?

      Not my area of expertise. The data could be validated with biological collection of specimens

      Is there sufficient information for others to reuse this dataset or integrate it with other data?

      The scientific names must follow the same nomenclature, the first time the full name Aedes aegypti is mentioned and the second time Ae.aegypti, if there are two species within the same genus only one is mentioned the first time and the second time both abbreviated Ae.aegypti and Ae.albopictus.

      Bibliographic references should be cited accordingly, for example: (1-4).

      The names of the diseases must follow the same writing with a capital letter at the beginning or all in lower case Chikungunya or chikungunya.

      Is there sufficient information for others to reuse this dataset or integrate it with other data?

      From the description of the study and the collection times, I would believe that it fits more with Sampling Events, the data is well organized, however, it is suggested to review the Darwin Core template for this type of data and adjust to the corresponding model. , event_core review: https://www.gbif.org/darwin-core.

      Additional Comments: The data paper can be published with suggestions for improvement. Congratulations, very good job!

      **Reviewer 2. Mary Ann Tuli **

      See the data audit file for more:

      https://gigabyte-review.rivervalleytechnologies.com/journal/gx/download-files?YXJ0aWNsZT00NjQmZmlsZT0xNzYmdHlwZT1nZW5lcmljJnZpZXc9dHJ1ZQ~~

      **Reviewer 3. Paul Taconet **

      Is the language of sufficient quality?

      Yes. Some minor changes that I recommend : "And the relative annual average humidity is 79%." may be changed to "The relative annual average humidity is 79%.". "Aedes albopictus is the most abundant species in the studied region" may be changed to "Aedes albopictus was the most abundant species in the studied region"

      Are all data available and do they match the descriptions in the paper?

      No.

      1/The data available are of type 'occurrence' (only in 1 file - the "occurrence" file). For a better presentation of the data, I would suggest to transform them into "sampling event" data, which is more suited to this kind of data acquired from sampling events (see https://ipt.gbif.org/manual/en/ipt/latest/sampling-event-data), while keeping the occurrence dataset. This would enable the user to quickly understand the dates and locations of the sampling events.

      2/ In the data, the only available date (column eventDate) is the first of January (eg. 2021-01-01T00:00:00). This does not enable to separte the data into seasons (Rainy et Dry) as presented in table 1 of the manuscript. I strongly suggest the authors to provide the specific date for each collected mosquito in the data.

      Is the data acquisition clear, complete and methodologically sound?

      No. 1/Larval collections : sampling strategy used ? 2/How many collection rounds in total ? please provide the dates of collection.

      Is there sufficient data validation and statistical analyses of data quality?

      No. 1/Human landing catch : was any quality control done during the collection of data (i.e. check that the collectors were at their place, etc.) ?

      Is there sufficient information for others to reuse this dataset or integrate it with other data?

      Yes. 1/comments for figure 1 (map) : - "legend" should be written in english (and not in french) - "harvesting sites" -> entomological collection points - the background layer is not very appropriate. Maybe better to put an Open Street Map background layer

      2/What about ethical approval for the Human Landing Catches ? please provide the name of the institution who has approved the HLC and the approval number, if relevant

      3/ in the dataset, for the species scientific name, I suggest to use the names as presented in : Harbach, R.E. 2013. Mosquito Taxonomic Inventory, https://mosquito-taxonomic-inventory.myspecies.info/ . Or at least, to provide the "nameAccordingTo" column.

      4/ In the dataset, many columns seem totally empty. Please remove them if so.

      Additional Comments: Thanks for this nice work and the effort put to publish your entomological data. I strongly suggest you to add the real dates of collection of the data in the GBIF dataset (see comments above).

      **Reviewer 4. Angeliki Martinou **

      Are all data available and do they match the descriptions in the paper?

      Yes. It will be good for the authors the first time that they cite the two species to use the full names Aedes (Stegomyia) albopictus (Skuse) Aedes (Stegomyia) aegypti (Linnaeus, 1762)

      In the methods section the title should be Human Landing Catches and not Human capture on landing

    1. Background Genotyping-by-Sequencing (GBS) provides affordable methods for genotyping hundreds of individuals using millions of markers. However, this challenges bioinformatic procedures that must overcome possible artifacts such as the bias generated by PCR duplicates and sequencing errors. Genotyping errors lead to data that deviate from what is expected from regular meiosis. This, in turn, leads to difficulties in grouping and ordering markers resulting in inflated and incorrect linkage maps. Therefore, genotyping errors can be easily detected by linkage map quality evaluations.Results We developed and used the Reads2Map workflow to build linkage maps with simulated and empirical GBS data of diploid outcrossing populations. The workflows run GATK, Stacks, TASSEL, and Freebayes for SNP calling and updog, polyRAD, and SuperMASSA for genotype calling, and OneMap and GUSMap to build linkage maps. Using simulated data, we observed which genotype call software fails in identifying common errors in GBS sequencing data and proposed specific filters to better handle them. We tested whether it is possible to overcome errors in a linkage map using genotype probabilities from each software or global error rates to estimate genetic distances with an updated version of OneMap. We also evaluated the impact of segregation distortion, contaminant samples, and haplotype-based multiallelic markers in the final linkage maps. Through our evaluations, we observed that some of the approaches produce different results depending on the dataset (dataset-dependent) and others produce consistent advantageous results among them (dataset-independent).Conclusions We set as default in the Reads2Map workflows the approaches that showed to be dataset-independent for GBS datasets according to our results. This reduces the number required of tests to identify optimal pipelines and parameters for other empirical datasets. Using Reads2Map, users can select the pipeline and parameters that best fit their data context. The Reads2MapApp shiny app provides a graphical representation of the results to facilitate their interpretation.

      This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giad092), which carries out open, named peer-review. This review is published under a CC-BY 4.0 license:

      Reviewer Name: Ramil Mauleon

      The paper titled "Developing best practices for genotyping-by-sequencing analysis using linkage maps as benchmarks" aims to present an end to end workflow uses GBS genotyping datasets to generate genetic linkage maps. This is a valuable tool for geneticists intending to generate a high confidence linkage map from a mapping population with GBS data as input.I got confused on reading the MS though, is this a workflow paper or is this a review of the component software for each step of genetic mapping and how parameter/use differences affect the output ? If it's a review, then the choice of software reviewed are not comprehensive enough, esp on SNP calling, and linkage mapping.There is no clear justification why each component software was used,example the use of GATK and freebayes for SNP calling I am familiar with using TASSEL GBS and STACKS for SNP calling using GBS data, why weren't they included in the SNP calling software. The MS would benefit greatly from including these SNP calling software in their benchmarking.Onemap and gusmap seems also pre-selected for linkage mapping, without reason for use, or maybe the reason(s) were not highlighted in the text. I've had experience in the venerable MAPMAKER and MSTMap, and would like to see more comparisons of the chosen genetic linkage mapping software with others, if this is the intent of the MS.The MS also clearly focuses on genetic linkage mapping using GBS, which should be more explicitly stated in the title. GBS is also extensively used in diversity collections and there is scant mention of this in the MS, and whether the workflow could be adapted to such populations.Versions of sofware used in the workflow are also not explicitly stated within the MS.The shiny app is also not demonstrated well in the MS, it could be presented better with screenshots of the interface , with one or two sample use cases.

    2. Background Genotyping-by-Sequencing (GBS) provides affordable methods for genotyping hundreds of individuals using millions of markers. However, this challenges bioinformatic procedures that must overcome possible artifacts such as the bias generated by PCR duplicates and sequencing errors. Genotyping errors lead to data that deviate from what is expected from regular meiosis. This, in turn, leads to difficulties in grouping and ordering markers resulting in inflated and incorrect linkage maps. Therefore, genotyping errors can be easily detected by linkage map quality evaluations.Results We developed and used the Reads2Map workflow to build linkage maps with simulated and empirical GBS data of diploid outcrossing populations. The workflows run GATK, Stacks, TASSEL, and Freebayes for SNP calling and updog, polyRAD, and SuperMASSA for genotype calling, and OneMap and GUSMap to build linkage maps. Using simulated data, we observed which genotype call software fails in identifying common errors in GBS sequencing data and proposed specific filters to better handle them. We tested whether it is possible to overcome errors in a linkage map using genotype probabilities from each software or global error rates to estimate genetic distances with an updated version of OneMap. We also evaluated the impact of segregation distortion, contaminant samples, and haplotype-based multiallelic markers in the final linkage maps. Through our evaluations, we observed that some of the approaches produce different results depending on the dataset (dataset-dependent) and others produce consistent advantageous results among them (dataset-independent).Conclusions We set as default in the Reads2Map workflows the approaches that showed to be dataset-independent for GBS datasets according to our results. This reduces the number required of tests to identify optimal pipelines and parameters for other empirical datasets. Using Reads2Map, users can select the pipeline and parameters that best fit their data context. The Reads2MapApp shiny app provides a graphical representation of the results to facilitate their interpretation.

      This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giad092), which carries out open, named peer-review. This review is published under a CC-BY 4.0 license:

      Reviewer name: Peter M. Bourke

      I read with interest the manuscript on Reads2Map, a really impressive amount of work went into this and I congratulate the authors on it. However, it is precisely this almost excessive amount of results that for me was the major drawback with this paper. I got lost in all the detail, and therefore I have suggested a Major Revision to reflect that I think the paper could be somehow made more stream lined with a clearer central message and fewer figures in the text. Line numbers would have been helpful, I have tried to give the best indication of page number and position, but in future @GigaScience please stick to line numbers for reviewers, it's a pain in the neck without them.

      Overall I think this is an excellent manuscript of general interest to anyone working in genomics, and definitely worthy of publication.Here are my more detailed comments:

      General comment: if a user would like to use GBS data for other population types than those amenable for linkage mapping (e.g. GWAS or genomic prediction, so a diversity panel or a breeding panel), how could your tool be useful for them?

      Other general comment: the manuscript is long with an exhaustive amount of figures and supplementary materials. Does it really need to be this detailed? It appears like the authors lost the run of themselves a little bit and tried to cram everything in, and in doing so risk losing the point of the endeavour. What is the central message of this manuscript? Regarding the figures, the reader cannot refer to the figures easily as they are now mainly contained on another page. Do you really need Figures 16-18 for example?

      Figures 13 and 14 could be combined perhaps? I am sure that at most 10 figures and maybe even less are needed in the main text, otherwise figures will always be on different pages and hence lose their impact in the text call-out.

      Abstract and page 4: "global error rate of 0.05" - How do you motivate the use of a global error rate of 5%? Surely this is dataset-dependent?

      Page 4 - how can a user estimate an error per marker per individual? The description of the create_probs function suggests there is an automatic methodology to do this, but I don't see it described. You could perhaps refer to Zheng et al's software polyOrigin, which actually locally optimises the error prior per datapoint. Maybe something for the discussion.

      Page 6 "recombination fraction giving the genomic order" do you mean "given"?Page 10 section Effects of contaminant samples - if you look at Figure 9 you can see that the presence of contaminant samples seems to have an impact on the genotypes of other, non-contaminant samples, especially using GATK and 5% global error. With the contaminants present, the number of XO points decreases in many other samples. This is very odd behaviour I would have thought. Is it known whether this apparent suppresion of recombination breakpoints in non-contaminant individuals is likely to be "correct"? Perhaps the SNP caller was running under the assumption that all individuals were part of the same F1? If the SNP caller was run without this assumption (eg. specifying only HW equilibrium, or model-free) would we still see the same effect? This is for me a quite worrying result but something that you make no reference to as far as I can tell.

      Page 12 "Effects of segregation distortion" In your study you only considered a single linkage group. One of the primary issues with segregation distortion in mapping is that it can lead to linkage disequilibrium between chromosomes, if selection has occurred on multiple loci. This can then lead to false linkages across linkage groups. Perhaps good to mention this.Page 12 "have difficulty missing linkage information" - missing word "with"

      Page 17 I see no mention of the impact of errors in the multi-allelic markers on the efficiency, particularly of order_seq which seems to be very poorly-performing with only bi-allelics (Fig 20). If bi-allelic SNPs have errors then it is not obvious why multi-SNP haplotypes should not also have errors.

      Page 3 Figure 1 - here the workflow shows multiple options for a number of the steps, which can lead to the creation of many map variants (e.g. 816 maps as mentioned on Page 4). Should all users produce 816 variants of their maps? With potentially millions of markers, this is going to take a huge amount of time (most users will want 100% of all chromosomes, not 37% of a single chromosome). Or should this be done for only a subset of markers? What if there is no reference sequence available to select a subset? As there are no clear recommendations, I suspect that the specific combination of pipeline choices will usually be datasetdependent. You actually mention this in the discussion

      page 17. And with only 2 real datasets from 2 different species, there is also no way to tell if eg. GATK works best in rose, or updog should be used for monocots but not dicots etc. It would be helpful if the authors were more explicit about how their tool informs "best practices for GBS analysis" for ordinary users. Perhaps it is there, but for me this message gets lost.

      Page 17 "updates in this version 3.0 to resolve issues with inflated genetic maps" - if I look at Figure 20, it seems that issues with inflated map length have not yet been fully resolved!

      Page 17 "we provide users tools to select the best approaches" - similar comment as before - does this mean users should build > 800 maps with a subset of their dataset first, and then use this single approach for the whole dataset? It is not explicitly stated whether this is the guidance given. What is the eventual aim - to produce a good linkage map, or to use the linkage map to critically compare genotyping tools?

    3. Background Genotyping-by-Sequencing (GBS) provides affordable methods for genotyping hundreds of individuals using millions of markers. However, this challenges bioinformatic procedures that must overcome possible artifacts such as the bias generated by PCR duplicates and sequencing errors. Genotyping errors lead to data that deviate from what is expected from regular meiosis. This, in turn, leads to difficulties in grouping and ordering markers resulting in inflated and incorrect linkage maps. Therefore, genotyping errors can be easily detected by linkage map quality evaluations.Results We developed and used the Reads2Map workflow to build linkage maps with simulated and empirical GBS data of diploid outcrossing populations. The workflows run GATK, Stacks, TASSEL, and Freebayes for SNP calling and updog, polyRAD, and SuperMASSA for genotype calling, and OneMap and GUSMap to build linkage maps. Using simulated data, we observed which genotype call software fails in identifying common errors in GBS sequencing data and proposed specific filters to better handle them. We tested whether it is possible to overcome errors in a linkage map using genotype probabilities from each software or global error rates to estimate genetic distances with an updated version of OneMap. We also evaluated the impact of segregation distortion, contaminant samples, and haplotype-based multiallelic markers in the final linkage maps. Through our evaluations, we observed that some of the approaches produce different results depending on the dataset (dataset-dependent) and others produce consistent advantageous results among them (dataset-independent).Conclusions We set as default in the Reads2Map workflows the approaches that showed to be dataset-independent for GBS datasets according to our results. This reduces the number required of tests to identify optimal pipelines and parameters for other empirical datasets. Using Reads2Map, users can select the pipeline and parameters that best fit their data context. The Reads2MapApp shiny app provides a graphical representation of the results to facilitate their interpretation.

      This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giad092), which carries out open, named peer-review. This review is published under a CC-BY 4.0 license:

      **Reviewer Name: Zhenbin Hu **

      In this MS, the authors tried to develop a framework for using GBS data for downstream analysis and reduce the impact of sequence errors caused by GBS. However, sequence error is an issue not specific to GBS, it is also for whole genome sequences. Actually, I think the major issue for GBS is the missing data. However, in this MS, the authors did not test the impact of missing data on downstream analysis.The authors also mentioned that sequencing error may cause distortion segregation in linkage map construction, however, distortion segregation in linkage map construction can also happen for correct genotyping data. The distortion segregation can be caused by individual selection during the construction of the population. So I don't think it is correct to use distortion segregation to correct sequence errors.The authors need to clear the major question of this MS, in the abstract, the authors highlight the sequence errors, while in the introduction, the authors highlight the package for linkage map construction (the last paragraph). Actually, from the MS, authors were assembling a framework for genotyping-by-sequencing data.Two major reduced-represented sequencing approaches, GBS and RADseq, have specific tools for genotype calling, such as Tassel and Stack. However, the authors used the GATK and Freebayes pipeline for variant calling, authors need to present the reason they were not using TASSEL and Stack.In the genotyping-by-sequencing data, individuals were barcoded and mixed during sequencing, what package/code was used to split the individuals (demultiplex) from the fastq for GATK and Freebayes pipeline?The maximum missing data was allowed at 25% for markers data, how about for the individual missing rate?On page 6, the authors mentioned 'seuqnece size of 350', what that means?

    1. AbstractBackground Single-cell RNA sequencing (scRNA-seq) provides high-resolution transcriptome data to understand the heterogeneity of cell populations at the single-cell level. The analysis of scRNA-seq data requires the utilization of numerous computational tools. However, non-expert users usually experience installation issues, a lack of critical functionality or batch analysis modes, and the steep learning curves of existing pipelines.Results We have developed cellsnake, a comprehensive, reproducible, and accessible single-cell data analysis workflow, to overcome these problems. Cellsnake offers advanced features for standard users and facilitates downstream analyses in both R and Python environments. It is also designed for easy integration into existing workflows, allowing for rapid analyses of multiple samples.Conclusion As an open-source tool, cellsnake is accessible through Bioconda, PyPi, Docker, and GitHub, making it a cost-effective and user-friendly option for researchers. By using cellsnake, researchers can streamline the analysis of scRNA-seq data and gain insights into the complex biology of single cells.

      This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giad091 ), which carries out open, named peer-review. These review is published under a CC-BY 4.0 license:

      **Reviewer name: Qianqian Song **

      This paper offers an open-source tool, i.e., cellsnake, to perform single-cell data analysis. This cellsnake tool offers advanced features for standard users and facilitates downstream analyses in both R and Python environments. It is also designed for easy integration into existing workflows, allowing for rapid analyses of multiple samples. I like the incorporation design of the metagenome analysis in this tool, which makes it different with other available tools in single-cell analysis.

      1) I looked through their tutorial, and have a specific question regarding the resolution parameter. I wonder if this resolution argument needs to be pre-selected? Or the cellsnake tool can automatically select a resolution parameter?

      2) Is it possible to add color legends in the umap? Rather than label all cell types on the umap. It can be very hard to distinguish the cell types, especially when there are many cell types available.

      3) If the single-cell data is profiled from human tissue, is it also possible to use cellsnake to perform microbiome analysis?

      4) I recommend the authors to compare cellsnake with other existing tools. Pros and cons need to be highlighted.

    2. Background Single-cell RNA sequencing (scRNA-seq) provides high-resolution transcriptome data to understand the heterogeneity of cell populations at the single-cell level. The analysis of scRNA-seq data requires the utilization of numerous computational tools. However, non-expert users usually experience installation issues, a lack of critical functionality or batch analysis modes, and the steep learning curves of existing pipelines.Results We have developed cellsnake, a comprehensive, reproducible, and accessible single-cell data analysis workflow, to overcome these problems. Cellsnake offers advanced features for standard users and facilitates downstream analyses in both R and Python environments. It is also designed for easy integration into existing workflows, allowing for rapid analyses of multiple samples.Conclusion As an open-source tool, cellsnake is accessible through Bioconda, PyPi, Docker, and GitHub, making it a cost-effective and user-friendly option for researchers. By using cellsnake, researchers can streamline the analysis of scRNA-seq data and gain insights into the complex biology of single cells.

      This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giad091 ), which carries out open, named peer-review. These review is published under a CC-BY 4.0 license:

      Reviewer name: Tazro Ohta

      The manuscript describes Cellsnake, a user-friendly tool for single-cell RNA sequencing analysis that targets non-expert users in the field of bioinformatics. Cellsnake operates as a command-line application, providing offline analysis capabilities for sensitive data. The integration of popular single-cell RNA-seq analysis software within Cellsnake, as described in Table 1, enhanced its utility as a comprehensive workflow. Cellsnake has different execution options (minimal, standard, and advanced) with varying outputs and execution times. The authors have provided well-structured online documentation, including helpful quick-start examples that facilitated easy understanding and usage of Cellsnake.

      The tool was tested using the Docker appliance and the provided fetal brain dataset and performed as expected. The manuscript explains the functions well, with the results reproduced from existing research using publicly available datasets. The following issues need to be addressed by the authors.

      1. The authors should include the citation for the Snakemake paper to acknowledge its contribution. https://doi.org/10.1093/bioinformatics/bts480

      2. To support the claim of unique features in Cellsnake, a comparison with other similar methods, such as that on Galaxy (https://doi.org/10.1093/gigascience/giaa102), should be included.

      3. It is recommended to host the Docker container image on both the GitHub Container Registry and the Docker Hub for better availability and redundancy. The authors should publish the Dockerfile to enable users to build a container image, if needed.

      4. Online documentation is missing a link to the fetal-liver example dataset (https://cellsnake.readthedocs.io/en/latest/fetalliver.html), which needs to be addressed. The fetalbrain dataset shared via Dropbox should also be deposited in the Zenodo repository to improve accessibility and long-term preservation.

      5. To assist users who want to use Cellsnake as a Snakemake workflow, the tool documentation should provide clear instructions on how to run Cellsnake as a single snakemake pipeline. This would be useful for users who utilize existing workflow platforms to accept snakemake requests.

      6. The benchmarking of Cellsnake must provide more precise specifications than simply referring to "a standard laptop" for computing requirements. My trial of "cellsnake integrated standard" with the fetalbrain dataset took more than 17 h via Docker execution on my M1 Max MacBook Pro. This may be because the provided Docker image is AMD-based, which let my MacBook run the container on a VM, but the recommended computational specifications will help users. The GitHub issue of the Cellsnake repository also mentioned that the software is not tested on Windows Conda, which should be mentioned at least in the online documentation.

      7. In the Data Availability section, please ensure that the correct formatting and consistent identifiers are used for public data, such as replacing SRP129388 with PRJNA429950 and E-MTAB-7407 with PRJEB34784, specifying that these IDs are from the Bioproject database. It is important to mention that EGA files are under controlled access, requiring user permission for retrieval.

      8. The references in the manuscript need to be properly formatted to ensure the inclusion of publication years and DOIs where available.

      9. The help message from the Cellsnake command indicates that its default values are set for human samples. The authors should mention in the manuscript that the pipeline is configured for human samples and requires further configuration for use with samples from other organisms. A step-by-step guide to configuring the setting for the other species, including the reference data download, would be helpful in obtaining more audiences.

    1. Background In recent years, three-dimensional (3D) spheroid models have become increasingly popular in scientific research as they provide a more physiologically relevant microenvironment that mimics in vivo conditions. The use of 3D spheroid assays has proven to be advantageous as it offers a better understanding of the cellular behavior, drug efficacy, and toxicity as compared to traditional two-dimensional cell culture methods. However, the use of 3D spheroid assays is impeded by the absence of automated and user-friendly tools for spheroid image analysis, which adversely affects the reproducibility and throughput of these assays.Results To address these issues, we have developed a fully automated, web-based tool called SpheroScan, which uses the deep learning framework called Mask Regions with Convolutional Neural Networks (R-CNN) for image detection and segmentation. To develop a deep learning model that could be applied to spheroid images from a range of experimental conditions, we trained the model using spheroid images captured using IncuCyte Live-Cell Analysis System and a conventional microscope. Performance evaluation of the trained model using validation and test datasets shows promising results.Conclusion SpheroScan allows for easy analysis of large numbers of images and provides interactive visualization features for a more in-depth understanding of the data. Our tool represents a significant advancement in the analysis of spheroid images and will facilitate the widespread adoption of 3D spheroid models in scientific research. The source code and a detailed tutorial for SpheroScan are available at https://github.com/FunctionalUrology/SpheroScan.

      This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giad082 ), which carries out open, named peer-review. This review is published under a CC-BY 4.0 license:

      **Reviewer Name: Francesco Pampaloni **

      This study represents a significant contribution to the field of screening and analysis of threedimensional cell cultures. The demand for reliable and user-friendly image processing tools to extract quantitative data from a large number of spheroids or other types of three-dimensional tissue models is substantial. The authors of this manuscript have developed a tool that aims to address this need by providing a straightforward method to extract the projected area and intensity of individual cellular spheroids imaged with bright-field microscopy. The tool is compatible with "Incucyte" microscopes or any other automated microscope capable of imaging multiple specimens, typically found in high-density multiwell plates.An admirable aspect of this work is the authors' decision to make all the code and pipeline openly available on Github. This openness allows other scientists to test and validate the code, promoting transparency and collaboration in the scientific community. However, several improvements should be made to the manuscript prior to publication.One important aspect that the authors should address in the manuscript is the suitability, rationale, and extent of using a neural network-based segmentation approach for the specific analysis described in the manuscript—segmentation of single bright-field images of spheroids.

      While neural networks are anticipated to play an increasingly important role in microscopy data segmentation in the coming years, they are not a universal solution. Although there may be segmentation tasks that are challenging to accomplish with traditional approaches, where neural networks can be highly effective, other segmentation tasks can be successfully performed using conventional strategies. For example, in our research group, we were able to reliably segment densely populated bright-field images containing numerous organoids in a single field of view using a pipeline based on the ImageJ plugin MorphoLibJ (see references: https://doi.org/10.1093/bioinformatics/btw413 and https://doi.org/10.1186/s12915-021-00958-w). Therefore, it would be informative and valuable for readers if the authors compared the results obtained from the neural network with those achieved by employing simple thresholding techniques (such as Otsu or Watershed) on the same dataset, as demonstrated in a similar study (reference: https://doi.org/10.1038/s41598-021-94217-1, Figure 5).

      Furthermore, to address the limitations of the model, the authors should provide specific examples (preferably in the supplementary material due to space constraints) of incorrect segmentations or artifacts that arise from applying the neural network to the data. For instance, it would be beneficial to explore scenarios where spheroids are surrounded by cellular debris or when multiple spheroids are present in the field of view. These real-life situations are common and it is important to provide insights into potential challenges that may arise when the images of the spheroids are not pristine.

    2. Background In recent years, three-dimensional (3D) spheroid models have become increasingly popular in scientific research as they provide a more physiologically relevant microenvironment that mimics in vivo conditions. The use of 3D spheroid assays has proven to be advantageous as it offers a better understanding of the cellular behavior, drug efficacy, and toxicity as compared to traditional two-dimensional cell culture methods. However, the use of 3D spheroid assays is impeded by the absence of automated and user-friendly tools for spheroid image analysis, which adversely affects the reproducibility and throughput of these assays.Results To address these issues, we have developed a fully automated, web-based tool called SpheroScan, which uses the deep learning framework called Mask Regions with Convolutional Neural Networks (R-CNN) for image detection and segmentation. To develop a deep learning model that could be applied to spheroid images from a range of experimental conditions, we trained the model using spheroid images captured using IncuCyte Live-Cell Analysis System and a conventional microscope. Performance evaluation of the trained model using validation and test datasets shows promising results.Conclusion SpheroScan allows for easy analysis of large numbers of images and provides interactive visualization features for a more in-depth understanding of the data. Our tool represents a significant advancement in the analysis of spheroid images and will facilitate the widespread adoption of 3D spheroid models in scientific research. The source code and a detailed tutorial for SpheroScan are available at https://github.com/FunctionalUrology/SpheroScan

      This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giad082 ), which carries out open, named peer-review. This review is published under a CC-BY 4.0 license:

      **Reviewer name: Kevin Tröndle **

      The authors present a "Technical Note" about an open-source web tool called SpheroScan. As input users could upload (large batches of) spheroid images (brightfield, 2D). The tool delivers two outputs: (1) Prediction Module: creates a file with area and intensity of detected spheroids (CSV), (2) Visualization Module: plots of the corresponding parameters (PNG). Performance was tested on 480 Incucyte images and 423 microscope images with 336 (70 %) and 265 for training, 144 (30 %) and 117 for validation, and 50 images for testing, respectively. The framework is based on Mask R-CNN and Detectron2 library. The performance was tested in the range of 0.5 to 0.95 against manual annotation (VGG Annotator). As evaluation measure they used Intersection over union (IoU), determining the overlap between the predicted and ground truth regions and calculates values of Average Precision (AP) for masking: 0.937 and 0.972 (Test), 0.927 and 0.97 (Validation) as well as AP for bounding box: 0.899 and 0.977 (test) 0.89 and 0.944 (Validation). They show a linear runtime, proofed with different sized datasets (1 s / image) for masking on a 16 core CPU, 64 GB RAM machine. The tool is available on GitHub and claimed to be available as a web tool on spheroscan.onrender.com.General evaluation:The concept of the tool serves some important needs of 3D cell culture-based assays: automated, standardized, high-throughput image analysis. As such, it represents value added for the research field.

      However, it remains open how high the impact, the reproducibility, and the chances of potential application by other researchers will be. This is due to some significant limitations in accessibility (i.e. non-permanent or non-functional web tool), as well as the (potential) restriction of input data (i.e. brightfield only, not validated with external data) and the limited options for analysis of the metadata (i.e. area and intensity only). The greatest value stems from the possibility to access a web interface, which is easy to use and will ideally be equipped with additional functionalities in the future.

      Comment 1 (minor):The presented tool uses the Mask R-CNN deep-learning model in their image processing pipeline. Several tools, which perform image segmentation, are based on this or other models are well-established and already implemented in several commercial imaging devices and allow for segmentation of cell containing image areas, e.g. to determine confluency or cell migration in "wound healing assays", mainly optimized for 2D cultures, but also applicable for 2D images of 3D spheroids. The concept of automated image segmentation is thus not novel and only meets the journal's input criterion as "update or adaptation of existing" tools.The state-of-the-art and preliminary work are not sufficiently referenced. Several similar and alternative (open-source) tools are existent and should be mentioned in the manuscript, e.g. (Lacalle et al., 2021; Piccinini et al., 2023; Trossbach et al., 2023), to give only a few examples.

      Comment 2 (major):The authors claim to present an user-friendly open-source web tool. The python project is available on Github, and on a demo-server (https://spheroscan.onrender.com/) where the web interface can be accessed. Unfortunately the mentioned web tool is not functional, i.e. it is stated on the website: "This is a demonstration server and the prediction module is not available for use. To utilize the prediction functionality, please run SpheroScan on your local machine.".This is significantly limiting the applicability of the presented tool to users who are able to execute python code on their local hardware. Therefore, the demo server should either present a functional user interface (recommended), or the statement should be removed from the manuscript, which would limit the impact of the submission significantly

      .Comment 3 (major):The presented algorithm was trained exclusively on internal data of brightfield images from "Incucyte and microscope platforms". Furthermore, two distinct models were generated, working with either Incucyte or microscope images.It remains unclear how the algorithm will perform on external data of prospective users. Given the fact that two distinct models had to be trained for different image sources (i.e. from two different platforms) indicates a limited robustness of the models in this regard. This is clearly a general problem of image processing algorithms, but one that will stand in the way of applicability by external users with certainly other imaging techniques. Since the web tool interface is not functional at this point, the authors will also not be able to evaluate or improve on this after publication. At least one performance test with external data, obtained from an ideally blinded user should be performed, to further elaborate on this.

      Comment 4 (major):Many assays nowadays use fluorescent labels, for example to calculate cell ratios within 3D arrangements, e.g. for cell viability or the expression of certain proteins. The authors do not state if the algorithm (or future iterations thereof) is or will be able to process multi-channel microscope images of spheroids.This is a significant limitation of the presented work and should at least be mentioned in the corresponding section, respectively. Furthermore, a proof-of-concept test run with fluorescent images could be performed to test the algorithm performance and derive potentially necessary adaptations in future versions.

      Comment 5 (minor):The output of the tool is a list of detected spheroids with corresponding area (2D) and bright field average intensity within the area.The usability of these two parameters is limited to specific assays, such as the mentioned use case to investigate collagen gel contraction assays. Several other parameters of interest could easily be derived from the metadata, such as roundness, volume estimation (assuming a spheroid shape), or even cell count estimation. This should again be mentioned in the "limitations and considerations" section.

    1. AbstractThe adoption of whole genome sequencing in genetic screens has facilitated the detection of genetic variation in the intronic regions of genes, far from annotated splice sites. However, selecting an appropriate computational tool to differentiate functionally relevant genetic variants from those with no effect is challenging, particularly for deep intronic regions where independent benchmarks are scarce.In this study, we have provided an overview of the computational methods available and the extent to which they can be used to analyze deep intronic variation. We leveraged diverse datasets to extensively evaluate tool performance across different intronic regions, distinguishing between variants that are expected to disrupt splicing through different molecular mechanisms. Notably, we compared the performance of SpliceAI, a widely used sequence-based deep learning model, with that of more recent methods that extend its original implementation. We observed considerable differences in tool performance depending on the region considered, with variants generating cryptic splice sites being better predicted than those that affect splicing regulatory elements or the branchpoint region. Finally, we devised a novel quantitative assessment of tool interpretability and found that tools providing mechanistic explanations of their predictions are often correct with respect to the ground truth information, but the use of these tools results in decreased predictive power when compared to black box methods.Our findings translate into practical recommendations for tool usage and provide a reference framework for applying prediction tools in deep intronic regions, enabling more informed decision-making by practitioners.

      This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giad085 ), which carries out open, named peer-review. The review is published under a CC-BY 4.0 license:

      Reviewer name: Raphael Leman

      Summary: In this work Barbosa et al., presented a benchmarking of several splicing predictors for human intronic variants. Overall, the results of this study shown that deep learning based tools such as SpliceAI outperformed the other splicing predictors to detect splicing disturbing variants and so pathogenic variants.

      The authors also detailed the performances of these tools on several subsets of data according to the collection origins of variants and according to the genomic localization of variants. This work is one of the first large and independent studies about splicing prediction performances among intronic variants and in particular among deep intronic variants in a context of molecular diagnosis. This work also highlights the need to have reliable prediction tools for these variants and that the splicing impact of these variants are often underestimated. However, I estimated that major points should to be solved before considering the article to publication.

      **Major points ** 1 The most important point is that authors shown results in the main text but in following paragraphs they claimed that these results were biased. In addition, the results, taking into account these biases, were only shown in supplementary data and the readers should make the correction themselves to get the "true" results. Indeed, the interpretation of biased results and "true" results changes drastically. The two main biases were: i) the use of ClinVar data already used for the training of CAPICE (see my following comment n°2-), ii) the intronic tags of variants and the relative distance to the nearest splice site were wrong (see my following comment n°5-). Consequently, the authors should remove these biased results and only show results after bias correction.

      2 Importantly, several tools used ClinVar variants or published data to train and/or validate their models. Therefore, to perform a benchmark on true independent collection of variants, the authors should ensure the lack of overlapping between variants used for the tool development and this present study.

      3 As authors shown by the comparison between the ClinVar classification (N = 54,117 variants) and impact on RNA from in vitro studies (N = 162 variants), there was discrepancies between this two information (N = 13/74 common variants, 18%). Consequently, using ClinVar classification to assay the performance of splicing prediction tools is not optimal. To partially fix this point, I think further studying (ex: get minor allele frequency, availability of in vitro RNA studies, …) the intronic variants with positive splicing predictions from two or more tools with a ClinVar classification benign or likely benign and inversely, the intronic variants with negative splicing predictions from two or more tools with a ClinVar classification pathogenic or likely pathogenic could be interesting.

      4 The authors used pre-computed databases for 19 tools, but the most of these databases do not include small insdels and so add artificially missing data in disfavor of the tool although the same tool could score these indels variants in de novo way.

      5 The authors said that "We hypothesized that variability in transcript structures could be the reason [increase in performance in the deepest intronic bins]: despite these variants being assigned as occurring very deep within introns (> 500bp from the splice site of the canonical isoform) in the reference isoform, they may be exonic or near-splice site variants of other isoforms of the associated gene". To solve this transcript structure variability, firstly the authors could use weighted relative distance as following: |(|Pos_(nearest splice site)-Pos_variant |)-Intron_Size |â•„(Intron_Size ). Secondly, the ClinVar data contains the RefSeq transcript ID on which the variant was annotated (except for large duplications/deletions), so the authors should make the correspondence between these RefSeq transcript IDs and the transcripts used to perform splicing predictions.

      6 With respect to the six categories of splice-altering variants, it is unclear how the authors considered cases in which variants alter physiological splice motives (e.g., natural consensus sequences 3'SS/5'SS, branch point, or ESR) but, instead of exon skipping, the spliceosome recruits another distant splice site that is partially or not affected by the variant.

      7 In the table 1 listing the tools considered for this study, please explicit for each tool on which collections of data (ClinVar or splicing altering variants) and for which genomic regions the benchmark was done. This information will facilitate the reading of the article.

      8 Accordingly to my comment n°3-, all spliceogenic variants are not necessary pathogenic. The mutant allele could produce aberrant transcripts without a frame-shift and without impact the functional domains of the protein. In addition, the transcription could also lead to a mix between aberrant transcript and full-length transcript. As a result, the main goal of splicing prediction tools is to detect splicing altering varaints. Considering variants with positive splicing prediction as pathogenic is a dangerous shortcut and only an in vitro RNA study could confirm the pathogenicity of a variant. The discussion section should be update in this sense.

      9 The authors claimed that: "The models [SQUIRLS and SPiP] were frequently able to correctly identify the type of splicing alteration, yet they still fail to propose higher-order mechanistic hypotheses for such predictions.". I think that the authors over-interpreted the results (see my comment n° 21-).

      10 The authors recommended prioritizing intronic variants using CAPICE, It is still true once the bias was corrected (see my comment n°1-).

      **Minor points **

      11 In the introduction the authors could clearly define the canonical splice site regions (AG/GT dinucleotides in 3'SS: -1/-2 and 5'SS: +1/+2) to make the difference with the consensus splice sites commonly define as: 3'SS: -12 (or -18)/+2 and 5'SS: -3/+6. 12 In the introduction, please also add that splice site activation could be also due to disruption of silencer motif. 13 In the ref [17], the authors did not say that the enrichment of splicing related variants within splice site regions was linked to exons and splice sites sequencing. They proved that whole genome sequencing increased the diagnostic rate of rare genetic disease, actually they did not focus on splicing variants. This enrichment was more probably induced by the fact that geneticists mainly studied variants with positive splicing predictions. 14 In the paragraph 'The prediction tools studied are diverse in methodology and objectives', please add that most of prediction tools target consensus splice sites (ex: MES, SSF, SPiCE, HSF, Adaboost, …).

      15 In the paragraph 'The prediction tools studied are diverse in methodology and objectives', the authors claimed that 'sequence-based deep learning models such as SpliceAI, which do not accept genetic variants as input.' but it is wrong as SpliceAI could accept VCF file as input. 16 In the paragraph 'Pathogenic splicing-affecting variants are captured well by deep learning based methods', this is further explained in the section method, but I think a sentence explaining that the 243 variants were from 81 variants described in ref [19] and 162 variants from a new collection will clarify the reading of article 17 In the paragraph 'Pathogenic splicing-affecting variants are captured well by deep learning based methods', among the 13 variants incorrectly classified, please detailed how many variants were classified as benign and VUS. 18 Due to the blue gradient, the Fig 1C is hard to analyze. 19 In the paragraph 'Branchpoint-associated variants', the variant rapported in the ref [79] were studied within tumoral context and so the observed impact could not be the same in healthy tissue. 20 In the paragraph 'Exonic-like variants', the authors changed the parameters of SpliceAI predictions, from the original prarameters used for the precomputed scores, to take into account variants located deep inside the pseudoexon. Please ensure whether other prediction tools have also user-defined optimizable parameters to take into account these variants. 21 In the paragraph 'Assessing interpretability', the authors observed that non-informative SPiP annotations presented a high score level. This could be explained by the fact of the tool report a positive prediction without annotation only because the model score was high without a relation to a particular splicing mechanism. 22 In the paragraph 'Assessing interpretability', the authors could compare the SpliceAI annotations regarding the abolition/creation of splice sites and their relative positions to the variants to the observed effect on RNA. 23 In the paragraph 'Predicting splicing changes across tissues', by my count the analysis of AbSpliceDNA predictions was done on 89 variants (154 - 65 = 89), if true please indicate clearly in the text. 24 In the method section, paragraph "ClinVar", the 13 variants with discordance between the classification and the observed splicing impact, how many did they have confidence stars. 25 In the method section, paragraph "Disease-causing intronic variants affecting RNA splicing", the authors filtered out variants within the 10 pb around the nearest splice site, please explicit why. 26 In the method section, paragraph "Disease-causing intronic variants affecting RNA splicing", the authors used gnomAD variants as control set, however their threshold of variant frequency is too low (1%). Indeed, some pathogenic variants involved in recessive genetic disorders have a high frequency in population. A threshold of 5% is more appropriate. 27 In the method section, paragraph "Variants that affect RNA splicing", the authors should describe how they considered variants leading to multiple aberrant transcripts and variants with partial effect (i.e., allele mutant still producing full length transcript). 28 In the method section, paragraph "Variants that affect RNA splicing", regarding the six categories defined by the authors: How the indels variants were annotated if they overlapped between several categories.

      The new splice donor/acceptor categories included only variants creating new AG/GT or variants occurring within the consensus sequences of cryptic splice sites. Among the category Donor-downstream, please make the distinction between variants located between [+3; +6] bp (i.e. consensus sequence) and variant beyond +6 bp. The exonic-like variants could be variants that did not impact ESRs motives (see my comment n°6-). 29 In the method section, paragraph "Variants that affect RNA splicing", the authors select for the control datasets, variants generating the CAGGT and GGTAAG motives. However, this approach lead to an over-enrichment of false positives. Moreover, it could be also interesting if among the variants creating new splice sites or pseudoexons to identify the presence of GC donor motif or U12-minor spliceosome motif (AT/AC) and how the different splicing tools can detect them. 30 In Fig S3C, scale the gnomAD population frequency in -logₕ₀(P) to make the figure more readable. 31 I saw several times double spaces in the text please correct them. English is not my native language so I am not the best judge, but some sentences seem syntactically incorrect (ex: "The splicing tools with the smallest and largest performance drop between the splice site bin ("1-2") and the "11-40" bin were Pangolin and TraP, with weighted F1 scores decreasing by 0.334 and 0.793, respectively"). Please have the article proofread by someone who is fluent in English.

    2. The adoption of whole genome sequencing in genetic screens has facilitated the detection of genetic variation in the intronic regions of genes, far from annotated splice sites. However, selecting an appropriate computational tool to differentiate functionally relevant genetic variants from those with no effect is challenging, particularly for deep intronic regions where independent benchmarks are scarce.In this study, we have provided an overview of the computational methods available and the extent to which they can be used to analyze deep intronic variation. We leveraged diverse datasets to extensively evaluate tool performance across different intronic regions, distinguishing between variants that are expected to disrupt splicing through different molecular mechanisms. Notably, we compared the performance of SpliceAI, a widely used sequence-based deep learning model, with that of more recent methods that extend its original implementation. We observed considerable differences in tool performance depending on the region considered, with variants generating cryptic splice sites being better predicted than those that affect splicing regulatory elements or the branchpoint region. Finally, we devised a novel quantitative assessment of tool interpretability and found that tools providing mechanistic explanations of their predictions are often correct with respect to the ground truth information, but the use of these tools results in decreased predictive power when compared to black box methods.Our findings translate into practical recommendations for tool usage and provide a reference framework for applying prediction tools in deep intronic regions, enabling more informed decision-making by practitioners.

      This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giad085 ), which carries out open, named peer-review. The review is published under a CC-BY 4.0 license:

      **Reviewer name: Jean-Madeleine de Sainte Agathe **

      This manuscript presents an important and very exhaustive benchmark concerning intronic variant splicing predictors. The focus on deep-intronic variants is highly appreciated as it addresses a very crucial challenge of today's genetics. The authors present the different tools in a very clear and pedagogical way. I should add that this manuscript is pleasant to read. The authors use the average precision score, allowing a refined comparison between tools.

      They give practical recommendations. They emphasize the use of SpliceAI and pangolin for intronic variants. For branchpoint regions, they recommend Pangolin and LabRanchoR. It should be noted that this study is to my knowledge the first independent benchmark of Pangolin, CISpliceAI, ConSpliceML, AbSplice-DNA, SQUIRLS, BPHunter, LaBranchoR and SPiP together. Overall, this study is important as it will be very helpful for the interpretation of intronic variants. I hence fully and strongly support its publication. I have several comments that (I think) should be addressed before publication, especially the first point:

      1) I admit that the curation of such large datasets is challenging, however, I failed to find some of the Table S6 variants in the referenced work. Please, could you kindly point me to the referenced variation for the following variants? - The variant "1 hg38_156872925 C T NTRK1 ENST00000524377.1:c.851-708C>T pseudoexon_inclusion keegan_2022" is classified as 'affects_splicing'. However, I did not find it in Keegan 2022 (reference 20). In Keegan, the table S1 mentions NTRK1 variants but not c.851-708C>T. For these NTRK1 variants, keegan et al refers to another publication Geng et al 2018 (PMC6009080), where I can't find the ENST00000524377.1:c.851-708C>T variants neither. - Same for "COL4A3 ENST00000396578.3:c.4462+443A>G 2:g.228173078A>G" - Same for "ABCA4 ENST00000370225.3:c.1937+435C>G 1:g.94527698G>C" - Same for "FECH ENST00000382873.3:c.332+668A>C 18:g.55239810T>G" - Concerning "MYBPC3 ENST00000545968.1:c.1224-52G>A 11:g.47364865C>T" , I did not find it in pbarbosa as stated, but in another reference which, I think, should be mentioned in this manuscript: https://pubmed.ncbi.nlm.nih.gov/33657327/ - "BRCA2 ENST00000544455.1:c.8332-13T>G 13:g.32944526T>G" is classified as splicing neutral based on moles-fernández_2021, but it has previously been shown to alter splicing (https://pubmed.ncbi.nlm.nih.gov/31343793/), please clarify. If these variants were somehow erroneously included, the authors should reprocess their results with the corrected datasets.

      2) Although it has been done before, the usage of gnomAD variants as a base of splicing-neutral variants is questionable. Indeed, it is theoretically possible that such variants truly alter splicing. For example, genuine splicing alterations can result in mild inframe consequences on the gene products. Or splicing alterations can damage non-essential genes. I suggest that the authors: -either select another gnomAD variants list located in disease-associated genes, where benign splicing alterations seem less plausible. -or discuss this putative limitation in their results.

      3) Table S8: "Variants above 0.05, the optimized SpliceAI threshold for non-canonical intronic splicing variation" Is that a recommendation of this work? Or was it found elsewhere? Please clarify. More generally, this manuscript uses Average Precision scores, but the authors should explain to their non-statistician readers how it relates to the delta scores of each tool (Fig 3C). Indeed, any indication (or even recommendation, but not necessarily) concerning the use of cut-off values would be very appreciated by the geneticist community.

      4) p.3 "If the model is run twice, once with the reference and once with the mutated sequence, it is possible to measure splice site alterations caused by genetic variants." This study makes only use of the delta scores, which have previously been shown to be misleading in some rare cases (PMID 36765386). The authors would be wise to mention this. For example, in Table S3, "ENST00000267622.4:c.5457+81T>A 14(hg19):g.92441435A>T" is predicted by SpliceAI DG=0.16, but as the reference prediction is already at 0.84, this 0.16 is the maximal delta score possible, yielding donor score = 1.

      5) p.12 "Among the tools that predict across whole introns, SQUIRLS and SPiP are the only ones designed to provide some interpretation of the outcome." Concerning the nature of the mis-splicing event, I think the authors should mention SpliceVault, which has been specifically built for this task (pmid 36747048).

      6) p.14: "SpliceAI and Pangolin […]. If usability is a concern and users do not have a large number of predictions to make, SpliceAI is preferred since the Broad Institute has made available a web app for the task" Now, the broad institute web app includes pangolin (at least for hg38 variants). Please, rephrase of delete this sentence.

      7) Concerning complex delins, which are not annotated with the current version of SpliceAI, the authors should give recommendations. For example, the complex delins from tableS9 "hg19_chr7 5354081 GC AT" is correctly predicted by CI-SpliceAI and SpliceAI-visual, both tools allowing the annotation of complex delins with the SpliceAI model.

      8) p.8 "Unfortunately, BPHunter only reported the variants predicted to disrupt the BP, rendering the Precision-Recall Curves (PR Curves) analysis impossible." I agree with the authors. However, I think it is sometimes assumed (wrongly?) that all variants unannotated by BPhunter have BPH_score=0. Maybe the authors could explicit this. For example, by saying that the lack of prediction cannot be safely equated with a negative prediction.

    1. Bats harbor various viruses without severe symptoms and act as their natural reservoirs. The tolerance of bats against viral infections is assumed to originate from the uniqueness of their immune system. However, how immune responses vary between primates and bats remains unclear. Here, we characterized differences in the immune responses by peripheral blood mononuclear cells to various pathogenic stimuli between primates (humans, chimpanzees, and macaques) and bats (Egyptian fruit bats) using single-cell RNA sequencing. We show that the induction patterns of key cytosolic DNA/RNA sensors and antiviral genes differed between primates and bats. A novel subset of monocytes induced by pathogenic stimuli specifically in bats was identified. Furthermore, bats robustly respond to DNA virus infection even though major DNA sensors are dampened in bats. Overall, our data suggest that immune responses are substantially different between primates and bats, presumably underlying the difference in viral pathogenicity among the mammalian species tested

      This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giad086 ), which carries out open, named peer-review. This review is published under a CC-BY 4.0 license.

      ** Reviewer name: Doreen Ikhuva Lugano **

      This paper gives a good introduction on bats as reservoirs of several viral infections, which studies have shown is due to the uniqueness of their immune system. They and others suggest that bats immune system is dampened exhibiting tolerance to various viruses. This gives the study a good rationale as to why study the bats immune system, compared to other mammals. They also give a good rationale as to why they used single-cell sequencing, to allow the identification of various cell types and the differences in these cell types. From their finding the main conclusions are that differences in the host species are more impactful; than those among the different stimuli. They also suggest that bats initiate an innate immune response after infection with DNA viruses through an alternative pathway. For example, the induction dynamics of PRRs seems to be different in their dataset. They also suggest this could be due to the presence of species-specific cellular subsets. 1. Interesting model system and a good comparison of bats with other mammals. 2. Good technique in using single-cell sequencing, with a clear rationale as to why it was chosen. This advances knowledge on what was already known about bats immune system, but the species-specific cellular subsets are new. 3. Interesting technique to go through the bulk transcriptomic data in four species and four conditions. This allowed findings of the most important genes/pathways. 4. Good rationale / flow of experiments from one to another 5. I liked that they investigated stimuli from different pathogens , including DNA, RNA virus and bacteria and still show that bats had a different immune system, in the different stimuli. Minor comments 1. Do they speculate this occurrence in is this just in Egyptian Fruit bats or all species of bats? 2. Mentioned in the introduction why they used the egyptian fruit bats - which are a model organism, but this could help people who are not in this field understand exactly why use these bats. Advantages? Location? Proximity to the various viruses based on the fact they are mostly found in endemic regions such as Africa etc. 3. Can they include viral load in each species? 4. It is not clear which scRNAseq tools were used for data analysis in identifying the types of cells. Or did they use already established database based on markers?

    2. Bats harbor various viruses without severe symptoms and act as their natural reservoirs. The tolerance of bats against viral infections is assumed to originate from the uniqueness of their immune system. However, how immune responses vary between primates and bats remains unclear. Here, we characterized differences in the immune responses by peripheral blood mononuclear cells to various pathogenic stimuli between primates (humans, chimpanzees, and macaques) and bats (Egyptian fruit bats) using single-cell RNA sequencing. We show that the induction patterns of key cytosolic DNA/RNA sensors and antiviral genes differed between primates and bats. A novel subset of monocytes induced by pathogenic stimuli specifically in bats was identified. Furthermore, bats robustly respond to DNA virus infection even though major DNA sensors are dampened in bats. Overall, our data suggest that immune responses are substantially different between primates and bats, presumably underlying the difference in viral pathogenicity among the mammalian species tested.

      This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giad086 ), which carries out open, named peer-review. This review is published under a CC-BY 4.0 license.

      ** Reviewer name: Urs Greber **

      Hirofumi Aso and colleagues provide a manuscript entitled 'Single-cell transcriptome analysis illuminating the characteristics of species specific innate immune responses against viral infections'. The aim was to describe differences in innate immune responses of peripheral blood mononuclear cells (PBMCs) from different primates and bats against various pathogenic stimuli (different viruses and LPS). A major conclusion from the study is that differences in the immune response between primate and bat PBMCs are more pronounced than those between DNA, RNA viruses or LPS, or between the cell types. The topic is of interest as the immunological basis for how bats appear to be largely disease resistant to some viruses that cause severe infections in humans is not well understood. One notion by others has been that bats have a larger spectrum of interferon (IFN) type I related genes, some of which are expressed constitutively even in unstimulated tissue, and there, trigger the expression of IFN stimulated genes (ISGs). Alongside, enhanced ISG levels may need to be compensated for in bats. Accordingly, bats may exhibit reduced diversity of DNA sensing pathways, as well as absence of a range of proinflammatory cytokines triggered in humans upon encountering acute disease causing viruses. The study here uses single-cell RNA sequencing (scRNA-seq) analysis, and transcript clustering algorithms to explore the profile of different innate immune responses upon viral infections of PBMCs from H sapiens, Chimpanzee, Rhesus macaque, and Egyptian fruit bat. Most commonly referred to cell types were detected in all four species, although naïve CD8+ T cells were not detected in bat PBMCs, which led the authors to focus on B cells, naïve T cells, killer T/NK cells, monocytes, cDCs, and pDCs. The study used three pathogenic stimuli, Herpex simplex virus 1 (HSV1), Sendai virus (SeV), and lipopolysaccharide (LPS). Specific comments The text is well written, concise, and per se interesting, but I have a few questions for clarification.

      1) Can the authors provide quality and purity control data for the virus inocula to document virus homogeneity? E.g., neither the methods, nor the indicated ref 26 specify if or how HSV1 was purified. Same is true for SeV where the provided ref 34 does not indicate if virus was purified or not. If virus inocula were not purified then it remains unclear to what extent the effects on the PBMCs described in the study here were due to virus or some other component in the inoculum. Conditions using inactivated inoculum might help to clarify this issue.

      2) What was the infection period? Was it the same for all viruses?

      3) Upon stimuli application, there was a noteable expansion of B cells and a compression of killer T / NK cells in the bat but not the human samples, as well as compression of monocytes, the latter observed in all four species. Can the authors comment on this observation?

      4) Lines 78-79: I do not think that TLR9 ought to be classified as a cytosolic DNA sensor. Please clarify.

      5) Line 117: please clarify that the upregulation of proinflammatory cytokines, ISGs and IFNB1 was measured at the level of transcripts not protein.

      6) Line 244: DNA sensors. Authors report that bats responded well to DNA viruses, although some of their DNA sensing pathways (e.g., STING downstream of cGAS, AIM2 or IFI16) were attenuated compared to primates (H sapies, Chimpanzee, Macaque). And they elute to the dsRNA PRR TLR3. But I am not sure if TLR3 is the only PRR to compensate for attenuated DNA sensing pathways. The authors might want to explicitly discuss if other RNA sensors, such as RIG-I-like receptors (RIG-I, LGP2, MDA5) were upregulated similarly in bats as in primate cells upon inoculation with HSV1.

      7) Is it known how much TLR3 protein is expressed in bat PBMCs under resting and stimulated conditions? Same question for the DNA and RNA sensor proteins, e.g., cGAS, AIM2 or IFI16, RIG-I, LGP2, MDA5, or effector proteins, such as STING.

      8) Can authors clarify if cGAS is part of the attenuated DNA sensors in the bat samples under study here? And it would be nice to see the attenuated response of DNA sensing pathways in the bat samples, as suspected from the literature, including STING downstream of cGAS, or AIM2 and IFI16.

      9) What are the expression levels of IFN-I and related genes in the bat cells among the different stimuli?

      10) Technical point: where can the raw scRNA-seq data be found?

  10. Oct 2023
    1. AbstractEvaluating the impact of amino acid variants has been a critical challenge for studying protein function and interpreting genomic data. High-throughput experimental methods like deep mutational scanning (DMS) can measure the effect of large numbers of variants in a target protein, but because DMS studies have not been performed on all proteins, researchers also model DMS data computationally to estimate variant impacts by predictors. In this study, we extended a linear regression-based predictor to explore whether incorporating data from alanine scanning (AS), a widely-used low-throughput mutagenesis method, would improve prediction results. To evaluate our model, we collected 146 AS datasets, mapping to 54 DMS datasets across 22 distinct proteins. We show that improved model performance depends on the compatibility of the DMS and AS assays, and the scale of improvement is closely related to the correlation between DMS and AS results.

      This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giad073 ), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer: Leopold Parts

      Summary Fu et al. explore utilising low-throughput mutational fitness measurements to predict the results of high-throughput deep mutational scanning experiments. They demonstrate that adding alanine scanning results to predictive models improves performance, as long as the alanine scan used a sufficiently similar evaluation approach to a deeper experiment. The findings make intuitive sense, and will be useful for the community to internalize.

      While we have several comments about the methods used, and requests to fortify the claims with more characterization, we do not expect addressing any of them will change the core findings. One can argue that direct application of AS boosted predictions is likely to be limited due to the number of scans available and the speed at which DMS experiments are now being performed, so it would also be useful to discuss the context of these results in the evolution of the field, and we make specific suggestions for this. Regardless, the presented results are a useful demonstration of a more general use case of low-throughput or partial mutagenesis data for improving fitness prediction and imputation.

      Major Comments

      There are many other computational variant effect predictors beyond Envision and DeMaSk. It would be very useful to see how their prediction results compare to some others, particularly the best performing and common models that are also straightforward to download and run (e.g. EVE, ESM1v, SIFT, PolyPhen2). This would be important context to see how impactful the addition of AS data is to DeMaSk/Envision. Please run additional prediction tools for reference of absolute performance; there is no need to incorporate AS data into them. Several proteins have a very small number of AS residues (Figure 2), and from our reading of the methods, other residue scores are imputed with the mean AS value for that protein. (As an aside, it would be good to clarify if this average is across studies or within study). If this reading is correct, the majority of residues for each proteins will have imputed AS results (e.g. in case of PTEN, over 90%), which can be problematic for training and prediction. Please clarify if our interpretation of the imputation approach is correct, and if so, please also provide results for a model trained without imputation, on many fewer residues. If the boosting model has already implemented this, please integrate the Supplementary methods into the main methods, and reference these and the results when describing the imputation approach to avoid such concerns. It is not clear how significant/impactful the increases in performance are in figures 4, 5, S4, S5 & S6. Please use a reasonable analytical test, or training data randomization to evaluate the improvement against a null model. There are quite a few proteins with repeated DMS/AS measurements. In our experience these correlate from moderately to very highly. Including multiple highly correlated studies could lead to pseudo-replication and biasing the model performance results. Please present a version of the results where the repeats are averaged first to test whether that bias exists. Minor Comments [suggestions only; no analyses required from us]

      A short discussion about the number of available alanine scans, particularly for proteins without DMS results, would help put the work in context. For example, it would be good to know how many proteins would benefit from improved de-novo predictions (e.g. no DMS data) and how many could have improved imputation (incomplete DMS data). Similarly the rate and cost of DMS data generation is important to understand the utility of their results. I think a short discussion of how useful models of this sort are in practice now and in future would be helpful to the reader. This seems most natural as part of the end of the discussion, but could also fit in the introduction. Figure 2 is missing y axis label. We also softly suggest log scale axis, to not obscure the degree to which some proteins have more residues covered and the proportion of residues covered by AS. Figure 3 includes DMS/AS study pairs with at least three alanine substitutions to compare - we think this is a low cut-off, particularly with the regularisation applied. I think something like 10+ would be more informative. I think their cross-validation scheme leaves out an entire protein at a time, as opposed to one study each iteration. I agree this is the better way to do it. However, I initially read it as the latter, which would lead to leakage between train/validation data since the same residue would be included in both if a protein had multiple datasets. It might be useful to be more explicit to prevent other readers doing the same. L231 In the discussion they mention fitting a model only using studies with a minimum DMS/AS correlation. This occurred to me as well while reading the relevant part of the results. Is there a good reason not to do this? It doesn't seem like a large amount of work and conceptually seems a good way to assess a model that says what a DMS might look like is it had the same selection criteria as a given AS. L154 Similarly, a correlation cut-off as well as choosing the most corelated study seems like it would be a fairer comparison in figure 5. Just because an AS is the most correlated doesn't necessarily mean it is well correlated. It would be interesting to see if the improvement results in figure 7 correlate with substitution matrices (e.g. Blosum) or DMS variant fitness correlations (e.g. correlation between A and C, A and D, etc.). Intuitively it feels like they should. It would be nice to label panels in figure 7. It also seems notable that predicting alanine substitutions is not the most improved - a brief comment on why would be interesting. The AS model adds 2x20 parameters to the model for encoding, which is a lot if CCR5 is held out, as there are only a few hundred total independent residues evaluated. While the performance on held out proteins is a good standard, it would be interesting to evaluate the increase from model selection perspective (BIC/AIC or similar) if possible. L217 The statement doesn't seem logical to me - if such advanced imputation methods were available surely they would be better used to impute all substitutions than just model alanine then use linear regression to model the rest? L331-332 The formula used for regularising Spearman's rho makes sense, and can likely be interpreted as a regularizing prior, but we found it hard to understand its provenance and meaning from the reference. A sentence on its content (not just describing that it shrinks estimates) and a more specific reference would be useful for interested readers like ourselves. L364 It says correlation results were dropped when only one residue was available whereas in figure legends it says results with less than three residues were dropped. Notwithstanding thinking three is maybe too low a cutoff, these should be consistent or clarified slightly if I've misunderstood the meaning. It would be nice to have a bit more comment on the purpose of the final supplementary section (Replacing AS data with DMS scores of alanine substitutions) - if you have DMS alanine results it seems likely you will have the other measurements anyway.

    2. AbstractEvaluating the impact of amino acid variants has been a critical challenge for studying protein function and interpreting genomic data. High-throughput experimental methods like deep mutational scanning (DMS) can measure the effect of large numbers of variants in a target protein, but because DMS studies have not been performed on all proteins, researchers also model DMS data computationally to estimate variant impacts by predictors. In this study, we extended a linear regression-based predictor to explore whether incorporating data from alanine scanning (AS), a widely-used low-throughput mutagenesis method, would improve prediction results. To evaluate our model, we collected 146 AS datasets, mapping to 54 DMS datasets across 22 distinct proteins. We show that improved model performance depends on the compatibility of the DMS and AS assays, and the scale of improvement is closely related to the correlation between DMS and AS results.

      This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giad073 ), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer: Joseph Ng

      This manuscript explored whether low-throughput alanine scanning (AS) experimental data could complement deep mutational scanning (DMS) to classify the impact of amino acid substitutions in a range of protein systems. The analysis partially confirms this hypothesis in that it only applies when the functional readout being measured in the two assays are compatible with one another. In my opinion this is an insight that should be highlighted in a publication and therefore I believe this manuscript deserved to be published. I just wish the authors could clarify & further explore the points below better in their manuscript before recommending for acceptance:

      In my opinion the most important bit of data curation is the classification of DMS/AS pairs as high/medium/low etc. compatible, and this is the key towards the authors' insight that assay compatibility is an important determinant of whether signals in the two datasets could be cross-matched for analysis. The criteria behind this classification are listed in Figure S2 but I feel the wording needs to be more specific. For example, in Figure S2, the authors wrote 'Both assays select for similar protein properties and under similar conditions' - what exactly does this mean? What does the authors consider to be 'similar protein properties'? I could not find more detailed explanation of this in the Methods section. The authors gave reasons in the spreadsheet in Supp. Table 1 for the labels they give to each pairs of assays, but I'm still not exactly sure what they consider to be 'similar'. Is there are more specific classification scheme which is more explicit in defining these 'similarities', e.g. by defining a scoring grid explicitly listing the different levels of 'similarities' of measurable properties, e.g. both thermal stability - score of 3; thermal stability vs protein abundance - 2; thermal stability vs cell survival - 1 (or equivalent, I think the key issue is to provide the reader with a clear guide so they can readily assess the compatibility of the datasets by themselves)? I would have thought discrepancy between the DMS and AS scores to be different across different structural regions of the protein, e.g. the discrepancy would be larger in ordered region compared to disorder as the protein fold would constrain the types of amino acids tolerable within the ordered segment of the protein. Is this the case in the authors' collection of datasets? If so, does the compatibility of assays modulate this discrepancy?

    3. Abstract

      This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giad073 ), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer: Leopold Parts

      Summary Fu et al. explore utilising low-throughput mutational fitness measurements to predict the results of high-throughput deep mutational scanning experiments. They demonstrate that adding alanine scanning results to predictive models improves performance, as long as the alanine scan used a sufficiently similar evaluation approach to a deeper experiment. The findings make intuitive sense, and will be useful for the community to internalize.

      While we have several comments about the methods used, and requests to fortify the claims with more characterization, we do not expect addressing any of them will change the core findings. One can argue that direct application of AS boosted predictions is likely to be limited due to the number of scans available and the speed at which DMS experiments are now being performed, so it would also be useful to discuss the context of these results in the evolution of the field, and we make specific suggestions for this. Regardless, the presented results are a useful demonstration of a more general use case of low-throughput or partial mutagenesis data for improving fitness prediction and imputation.

      Major Comments

      There are many other computational variant effect predictors beyond Envision and DeMaSk. It would be very useful to see how their prediction results compare to some others, particularly the best performing and common models that are also straightforward to download and run (e.g. EVE, ESM1v, SIFT, PolyPhen2). This would be important context to see how impactful the addition of AS data is to DeMaSk/Envision. Please run additional prediction tools for reference of absolute performance; there is no need to incorporate AS data into them. Several proteins have a very small number of AS residues (Figure 2), and from our reading of the methods, other residue scores are imputed with the mean AS value for that protein. (As an aside, it would be good to clarify if this average is across studies or within study). If this reading is correct, the majority of residues for each proteins will have imputed AS results (e.g. in case of PTEN, over 90%), which can be problematic for training and prediction. Please clarify if our interpretation of the imputation approach is correct, and if so, please also provide results for a model trained without imputation, on many fewer residues. If the boosting model has already implemented this, please integrate the Supplementary methods into the main methods, and reference these and the results when describing the imputation approach to avoid such concerns. It is not clear how significant/impactful the increases in performance are in figures 4, 5, S4, S5 & S6. Please use a reasonable analytical test, or training data randomization to evaluate the improvement against a null model. There are quite a few proteins with repeated DMS/AS measurements. In our experience these correlate from moderately to very highly. Including multiple highly correlated studies could lead to pseudo-replication and biasing the model performance results. Please present a version of the results where the repeats are averaged first to test whether that bias exists. Minor Comments [suggestions only; no analyses required from us]

      A short discussion about the number of available alanine scans, particularly for proteins without DMS results, would help put the work in context. For example, it would be good to know how many proteins would benefit from improved de-novo predictions (e.g. no DMS data) and how many could have improved imputation (incomplete DMS data). Similarly the rate and cost of DMS data generation is important to understand the utility of their results. I think a short discussion of how useful models of this sort are in practice now and in future would be helpful to the reader. This seems most natural as part of the end of the discussion, but could also fit in the introduction. Figure 2 is missing y axis label. We also softly suggest log scale axis, to not obscure the degree to which some proteins have more residues covered and the proportion of residues covered by AS. Figure 3 includes DMS/AS study pairs with at least three alanine substitutions to compare - we think this is a low cut-off, particularly with the regularisation applied. I think something like 10+ would be more informative. I think their cross-validation scheme leaves out an entire protein at a time, as opposed to one study each iteration. I agree this is the better way to do it. However, I initially read it as the latter, which would lead to leakage between train/validation data since the same residue would be included in both if a protein had multiple datasets. It might be useful to be more explicit to prevent other readers doing the same. L231 In the discussion they mention fitting a model only using studies with a minimum DMS/AS correlation. This occurred to me as well while reading the relevant part of the results. Is there a good reason not to do this? It doesn't seem like a large amount of work and conceptually seems a good way to assess a model that says what a DMS might look like is it had the same selection criteria as a given AS. L154 Similarly, a correlation cut-off as well as choosing the most corelated study seems like it would be a fairer comparison in figure 5. Just because an AS is the most correlated doesn't necessarily mean it is well correlated. It would be interesting to see if the improvement results in figure 7 correlate with substitution matrices (e.g. Blosum) or DMS variant fitness correlations (e.g. correlation between A and C, A and D, etc.). Intuitively it feels like they should. It would be nice to label panels in figure 7. It also seems notable that predicting alanine substitutions is not the most improved - a brief comment on why would be interesting. The AS model adds 2x20 parameters to the model for encoding, which is a lot if CCR5 is held out, as there are only a few hundred total independent residues evaluated. While the performance on held out proteins is a good standard, it would be interesting to evaluate the increase from model selection perspective (BIC/AIC or similar) if possible. L217 The statement doesn't seem logical to me - if such advanced imputation methods were available surely they would be better used to impute all substitutions than just model alanine then use linear regression to model the rest? L331-332 The formula used for regularising Spearman's rho makes sense, and can likely be interpreted as a regularizing prior, but we found it hard to understand its provenance and meaning from the reference. A sentence on its content (not just describing that it shrinks estimates) and a more specific reference would be useful for interested readers like ourselves. L364 It says correlation results were dropped when only one residue was available whereas in figure legends it says results with less than three residues were dropped. Notwithstanding thinking three is maybe too low a cutoff, these should be consistent or clarified slightly if I've misunderstood the meaning. It would be nice to have a bit more comment on the purpose of the final supplementary section (Replacing AS data with DMS scores of alanine substitutions) - if you have DMS alanine results it seems likely you will have the other measurements anyway.

    4. Abstract

      This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giad073 ), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer: Joseph Ng

      This manuscript explored whether low-throughput alanine scanning (AS) experimental data could complement deep mutational scanning (DMS) to classify the impact of amino acid substitutions in a range of protein systems. The analysis partially confirms this hypothesis in that it only applies when the functional readout being measured in the two assays are compatible with one another. In my opinion this is an insight that should be highlighted in a publication and therefore I believe this manuscript deserved to be published. I just wish the authors could clarify & further explore the points below better in their manuscript before recommending for acceptance:

      In my opinion the most important bit of data curation is the classification of DMS/AS pairs as high/medium/low etc. compatible, and this is the key towards the authors' insight that assay compatibility is an important determinant of whether signals in the two datasets could be cross-matched for analysis. The criteria behind this classification are listed in Figure S2 but I feel the wording needs to be more specific. For example, in Figure S2, the authors wrote 'Both assays select for similar protein properties and under similar conditions' - what exactly does this mean? What does the authors consider to be 'similar protein properties'? I could not find more detailed explanation of this in the Methods section. The authors gave reasons in the spreadsheet in Supp. Table 1 for the labels they give to each pairs of assays, but I'm still not exactly sure what they consider to be 'similar'. Is there are more specific classification scheme which is more explicit in defining these 'similarities', e.g. by defining a scoring grid explicitly listing the different levels of 'similarities' of measurable properties, e.g. both thermal stability - score of 3; thermal stability vs protein abundance - 2; thermal stability vs cell survival - 1 (or equivalent, I think the key issue is to provide the reader with a clear guide so they can readily assess the compatibility of the datasets by themselves)? I would have thought discrepancy between the DMS and AS scores to be different across different structural regions of the protein, e.g. the discrepancy would be larger in ordered region compared to disorder as the protein fold would constrain the types of amino acids tolerable within the ordered segment of the protein. Is this the case in the authors' collection of datasets? If so, does the compatibility of assays modulate this discrepancy?

    1. **Editors Assessment: **

      Irises on top of being a popular and beautiful ornamental plant, have wider commercial interest due to the many interesting secondary metabolites present in their rhizomes that have value to the fragrance and pharmaceutical industries. Many of these have large and difficult to assemble genomes, and to fill that gap the Dalmatian Iris (Iris pallida Lam.) is sequenced here. Using PacBio long-read sequencing and bionano optical mapping to produce a giant 10Gbp assembly with a scaffold N50 of 14.34 Mbp. The authors didn’t manage to handle the haplotigs separately or to study the ploidy, but as all of the data is available for reuse others can explore these questions further. This reference genome should also allow researchers to study the biosynthesis of these secondary metabolites in much greater detail, opening new avenues of investigation for drug discovery and fragrance formulations.

      This evaluation refers to version 1 of the preprint

    2. Irises are perennial plants, representing a large genus with hundreds of species. While cultivated extensively for their ornamental value, commercial interest in irises lies in the secondary metabolites present in their rhizomes. The Dalmatian Iris (Iris pallida Lam.) is an ornamental plant that also produces secondary metabolites with potential value to the fragrance and pharmaceutical industries. In addition to providing base notes for the fragrance industry, iris tissues and extracts possess anti-oxidant, anti- inflammatory, and immunomodulatory effects. However, study of these secondary metabolites has been hampered by a lack of genomic information, instead requiring difficult extraction and analysis techniques. Here, we report the genome sequence of Iris pallida Lam., generated with Pacific Bioscience long-read sequencing, resulting in a 10.04 Gbp assembly with a scaffold N50 of 14.34 Mbp and 91.8% complete BUSCOs. This reference genome will allow researchers to study the biosynthesis of these secondary metabolites in much greater detail, opening new avenues of investigation for drug discovery and fragrance formulations.

      This work has been published in GigaByte Journal under a CC-BY 4.0 license (https://doi.org/10.46471/gigabyte.94), and has published the reviews under the same license. These are as follows.

      **Reviewer 1. Baocai Han **

      Iris pallida Lam., an ornamental plant, produces secondary metabolites with potential value to the fragrance and pharmaceutical industries, while also possessing anti-oxidant, anti-inflammatory, and immunomodulatory effects. The genome assembly of this species could be more helpful in investigation for drug discovery and fragrance formulations.

      I have a number of comments that follow:

      1. Line 10 (page 2): “resulting in a 10.04 Gbp assembly with a scaffold N50 of 14.34 Mbp”. I found the genome size is 13.49 Gb in Table 2 and line 18 (page 7) due to differing haplotigs in the phased assembly. While I can not find how to deal with this problem. I suggest to purge the duplicates from the genome using the Purge_Dups pipeline. (Guan D, McCarthy SA, Wood J et al. Identifying and removing haplotypic duplication in primary genome assemblies. Bioinformatics, 2020; 36(9): 2896–2898.)

      2. Line 5 (page 8): why is the gene number of the Complete and duplicated BUSCOs so high. Is it due to issues with genome assembly or the presence of a particularly high number of repetitive sequences in the species?

      3. there is no reference or website for many softwares and pipelines, eg. HybridScaffolding pipeline (line 22, page 5), lima (line 2, page 6) and Exonerate (line 11, page 6)

      4. I suggest upload the genome annotation file, given that genome annotation has already been performed.

      **Reviewer 2. Kang Zhang **

      Is the language of sufficient quality?

      Yes. Though I found several sentences confusing: P2L8 (Is the DNA/RNA extraction particularly difficult for iries?), and P9L5 (wording).

      Is there sufficient information for others to reuse this dataset or integrate it with other data?

      Yes. With the following comments.

      1. P7L20. The basic stats of the subreads should be introduced before the assembling process.
      
      1. The authors should provide more methodological details about the BUSCO assessment, such as the database version, the mode (genome or protein), etc.
      2. I am curious about the genome size enlargement introduced by the scaffolding. Were different haplotigs (from different haplotypes) were used for scaffolding, and why? I suppose that only the primary haplotigs should be used.
      3. Considering the high proportion of duplicated BUSCO genes, I wonder whether the iris sequenced is a polyploid or not? Please clarify it in the Background.

      Additional Comments: Dr. Wong and her colleagues reported a genome assembly of iris using the PacBio technology. Due to the huge genome size, the generated data volume is impressive. Although the quality of the assembly is not so satisfying, it is reasonable considering the genome size and the high heterozygosity, which is commonly found in many flowers. Overall, the methods used in this work are well described, and the data could be accessed. I only get several minor points regarding the details during the assembling process.

  11. Sep 2023
    1. **Editors Assessment: **

      While Bacterial Artificial Chromosomes libraries were once a key resource for building the human genome project over time they have been rendered relatively obsolete by long-read technologies. In the era of CRISPR-Cas systems pairing this data with one of the many guide-RNA libraries to find targets for manipulation with CRISPR tools is bringing back BACs advantages for genomics. With this in mind the authors have developed a BAC restriction map database containing the restriction maps for both uniquely placed and insert-sequenced BACs from 11 libraries covering the recognition sequences of available restriction enzymes. Alongside a set of Python functions to reconstruct the database and more easily access it (which were debugged and had improved documentation added during review). The presented data should be valuable for researchers simply using BACs, as well as those working with larger sections of the genome in terms of synthetic genes, large-scale editing, and mapping.

      *This evaluation refers to version 1 of the preprint *

    2. AbstractWhile Bacterial Artificial Chromosomes were once a key resource for the genomic community, they have been obviated, for sequencing purposes, by long-read technologies. Such libraries may now serve as a valuable resource for manipulating and assembling large genomic constructs. To enhance accessibility and comparison, we have developed a BAC restriction map database.

      This work has been published in GigaByte Journal under a CC-BY 4.0 license (https://doi.org/10.46471/gigabyte.93), and has published the reviews under the same license. These are as follows.

      **Reviewer 1. Po-Hsiang Hung **

      Are all data available and do they match the descriptions in the paper?

      No. The dataset in FTP includes all the Bac sequences and the restriction enzyme recognition sites in csv files. However, I could not find the database of pairs of BACs, which have overlaps generated by restriction enzymes that linearize the BACs. The makePairs function gave me an error when I tried running it locally, so I was not able to verify what is in these datasets. Personally, I find this function to be one of the most useful features described in this manuscript.

      Are the data and metadata consistent with relevant minimum information or reporting standards? See GigaDB checklists for examples http://gigadb.org/site/guide

      Yes. This manuscript contains the necessary minimal information (Submitting author, Author list, Dataset title, Dataset description, and Funding information)

      Is there sufficient detail in the methods and data-processing steps to allow reproduction?

      No. The authors provide their code in GitHub such that researchers can download the datasets and analyze the sequences locally. However, I felt that the descriptions in the readme.md file is often insufficient to reproduce the data presented in the manuscript, especially for researchers with little to no programming experience. Detailed information includes examples of how to use each function, the input format, and the location of the output folder/files. I also encountered software version issues during the installation of bacmapping. Please re-test the code in a new environment and describe all the versions of each software. For instance, I found Python version 3.11 is incompatible with this package while Python version 3.7 is compatible.

      Is there sufficient data validation and statistical analyses of data quality?

      No. The author used the BioRestriction class from Biopython to get the digestion site information. No extra validation is conducted in this manuscript. Due to the errors I encountered in re-running the code (see details in Any Additional Overall Comments to the Author), an independent method for checking several digestion sites in some Bac clones is suggested. The suggested independent method is to do enzyme digestion on some Bac clones or upload some Bac sequences to other software and compare the digestion sites.

      In the output files that contain the digestions sites for each enzyme, some of the enzyme digestion sites are either NA or []. What is the difference between the two? If they mean the same thing (no cutting by the enzyme), bugs or other coding errors may cause this inconsistency. Please check the code again and also verify some of them using the independent methods suggested above. Examples of this issue are the files in maps>sequenced>CEPHB. Here I list two enzymes that show different results in each file: 3.csv : Ragl ([]), SchI (NA) 6.csv: EspEI (NA), AccII([]) 13.csv: EcoT22I ([]), Hsp92II (NA) X.csv: PacI ([]), AcIWI (NA)

      Is the validation suitable for this type of data?

      No. No validation in this manuscript. See the answer above.

      Additional Comments: The authors make a database with enzyme digestion site information of Bac clones to help people to use the Bac clones for further usage. I think it is useful to have this information and also have the code to do further analysis locally. Thus, I think providing a very detailed user manual (or readme.md) is very important to help people use this dataset. Below I summarized the issues I encountered in running codes and also some suggestions. Major points: (1) I tested some bacmapping functions, and I discovered that some functions are not working as intended due to typos/bugs - The version of the software is required to help people properly install this package - Refining the code and also providing a better user manual is very helpful for people without a lot of coding experience to use it. The detailed information includes examples of how to use each function, the input format, and the location of the output folder/files. Descriptions for some functions in the readme file are not detailed enough and often do not describe what the input needs to be. For example, getCuts() require ‘row’ as input. But the author never gives a detailed description of what ‘row’ is in the readme file. I had to look in bacmapping.py to understand what ‘row’ is. If a function requires the variable ‘row’, show a few examples of how ‘row’ can be extracted from the proper input file. - mapPlacedClones() requires an input file (‘/home/eamon/BACPlay/longboys.csv’, line 335) that is located in the author’s local computer and is not available through github. - Typo in line 814 in getMap(). Should be: name = cloneLine[‘CloneName’] - Inconsistency in output variable type in getMap() (line 830 and 851). When local == ‘sequenced’, the output variable is a tuple, which causes issues in downstream functions such as getRestrictionMap() (line 869). (2) Add pairs of BACs into the dataset (3) The output file of digestion sites of each enzyme, some of the enzyme digestion sites showed NA or [ ]. Please double-check this and explain the differences (4) Validation of an independent method for the digestion map is suggested

      Minor points: (1) Add a title to each column of sequencedStats.csv is useful for understanding the table easier

      Re-review:

      The authors have addressed majority of my points. The software installation works great after considering version control. The updated read.me provide detailed information for each function and their required input variables, and the examples in jupyter notebook are a great help for running the code. I did, however, encounter two minor errors when I tested the Ch19_bacmapping_example.ipynb on a Mac system. Please check this and update it.

      (1)The .DS_store file that is automatically generated on a Mac system in the bacmapping/Examples/Ch19_example/maps/placed folder causes an error when running bmap.mapPlacedClones(cpustouse=cpus, chunk_size=chunksize). The same problem happened when I ran bmap.mapSequencedClones(cpustouse=cpus). After I deleted .DS_store in the folder, the code worked.

      Here is the error message when I ran bmap.mapSequencedClones(cpustouse=cpus). NotADirectoryError: [Errno 20] Not a directory: '/Users/user_nsame/bacmapping/Examples/Ch19_example/maps/sequenced/.DS_Store'

      (2) The second error is from running bmap.getRestrictionMap(name,enzyme). I got the error message, 'list' object has no attribute 'item'. I was able to run this function after changing maps[enzyme].item() to maps[enzyme] in line 779 of bacmapping.py. I encountered the same error with the drawMap function. I was able to run to run this function after changing line 847 of bacmapping.py from rmap = maps[nenzyme].item() to rmap = maps[nenzyme].item().

      Here is the error message

      AttributeError Traceback (most recent call last) Cell In[20], line 5 3 maps = bmap.getMaps(name) 4 #print(maps) #this is a big dataframe of all the maps, uncomment to check it out ----> 5 rmap = bmap.getRestrictionMap(name,enzyme) 6 print('Sites in ' + name + ' where ' + enzyme + ' cuts: '+ str(rmap)) 7 plt = bmap.drawMap(name, enzyme)

      File ~/miniconda3/envs/bacmapping/lib/python3.11/site-packages/bacmapping/bacmapping.py:779, in getRestrictionMap(name, enzyme) 777 maps = getMaps(name) 778 nenzyme, r = getRightIsoschizomer(enzyme) --> 779 return(maps[nenzyme].item())

      AttributeError: 'list' object has no attribute 'item'

      **Reviewer 2. Wei Dong **

      Is there sufficient data validation and statistical analyses of data quality? Not my area of expertise

      Is the validation suitable for this type of data? I am not sure about this.This is not my specialty.

      Overall comments: This is a great idea, fully exploring, integrating, and utilizing existing data for new research.

    1. **Editors Assessment: **

      This work presents a new standardized graphical approach for visualizing genetic associations across a wide range of allele frequencies. These proposed TrumpetPlots have a distinctive trumpet shape, hence the proposed name. With the majority of variants having low frequency and small effects, while a small number of variants have higher frequency and larger effects, this view can help to provide new and valuable insights into the genetic basis of traits and diseases, and also help prioritize efforts to discover new risk variants. The tool is provided as a novel R package and R Shiny application and to demonstrate its use the article illustrates the distribution of variant effect sizes across the allele frequency range for over 100 continuous traits available in the UK Biobank. After some problems in testing the package is now available and easy to deploy via CRAN.

      *This assessment refers to version 1 of this preprint. *

    2. AbstractRecent advances in genome-wide association study (GWAS) and sequencing studies have shown that the genetic architecture of complex diseases and traits involves a combination of rare and common genetic variants, distributed throughout the genome. One way to better understand this architecture is to visualize genetic associations across a wide range of allele frequencies. However, there is currently no standardized or consistent graphical representation for effectively illustrating these results.Here we propose a standardized approach for visualizing the effect size of risk variants across the allele frequency spectrum. The proposed plots have a distinctive trumpet shape, with the majority of variants having low frequency and small effects, while a small number of variants have higher frequency and larger effects. These plots, which we call ‘trumpet plots’, can help to provide new and valuable insights into the genetic basis of traits and diseases, and can help prioritize efforts to discover new risk variants. To demonstrate the utility of trumpet plots in illustrating the relationship between the number of variants, their frequency, and the magnitude of their effects in shaping the genetic architecture of complex diseases and traits, we generated trumpet plots for more than one hundred traits in the UK Biobank. To facilitate their broader use, we have developed an R package ‘TrumpetPlots’ and R Shiny application, available at https://juditgg.shinyapps.io/shinytrumpets/, that allows users to explore these results and submit their own data.

      This work has been published in GigaByte Journal under a CC-BY 4.0 license (https://doi.org/10.46471/gigabyte.89) and has published the reviews under the same license. These are as follows.

      **Reviewer 1. Clara Albiñana **

      As Open Source Software are there guidelines on how to contribute, report issues or seek support on the code?

      No. Although there are no explicit guidelines for contribution in the manuscript or website, it is true that by placing the project on gitlab it is possible to contribute to the project / open issues.

      Is the code executable?

      No. Unfortunately, I wasn't able to install the R package. I have now opened an issue on the gitlab page so that it can hopefully get solved.

      Is installation/deployment sufficiently outlined in the paper and documentation, and does it proceed as outlined?

      Yes. It is very common for new R packages to just use devtools for installation.

      Is the documentation provided clear and user friendly?

      Yes. The requirements for generating a trumpet plot just involve providing a set of GWAS summary statistics with column-specific names, together with the GWAS sample size. This is very common for GWAS summary statistics-based tools. I think it is fine for the R package to require re-naming the columns to fit the format, as one already needs to upload the file into R. However, I find it inconvenient to have to re-save the summary statistics file with different name-columns for the shinyapp tool. Providing e.g. column indexes alone would be much more user-friendly.

      Is there enough clear information in the documentation to install, run and test this tool, including information on where to seek help if required?

      No. I cannot answer this question until I can install the tool.

      Have any claims of performance been sufficiently tested and compared to other commonly-used packages?

      Not applicable. There are no existing comparable tools.

      Is automated testing used or are there manual steps described so that the functionality of the software can be verified?

      Yes. I can see there is a toy dataset included with the R package.

      Additional Comments:

      I think the manuscript is very clear and good at making the point of the utility of the software. The proposed trumpet plots are very visually appealing and can be useful to characterise the genetic variation of diverse phenotypes. The novelty of the trumpet plots, as compared to previously proposed effect size vs. allele frequency plots, is the use of positive and negative effect sizes, making it look like a trumpet. I also appreciate the style decisions in the standard generated plots, with a nice visually-appealing color scheme and design.

      On the use of the software, I have focused my testing on the R package, which I was not able to install. The shinyapp is very useful for visualising the existing, pre-computed trumpet plots, but I do not find it very useful for generating user-uploaded summary statistics for the reasons I mentioned above. Another comment on the ShinyApp is that I appreciate the possibility to download the plots but it would be very useful to include the name of the visualized phenotype as the plot title, for example, to avoid confusion when downloading multiple plots.

      I also found an incorrect sentence in the abstract, which is think should be reversed: " The proposed plots have a distinctive trumpet shape, with the majority of variants having low frequency and small effects, while a small number of variants have higher frequency and larger effects".

      **Reviewer 2. Wentian Li **

      Is the documentation provided clear and user friendly?

      No. Many aspects of Fig.1 are not explained.

      Overall Comments: Plots with allele frequency as x axis and effect size (e.g. odds ratio) as y axis is a very common display of the contribution from both common and rare alleles to genetic association. A schematic form of this plot is practically on almost everybody's presentation slides when introducing this topic (to see an example, see, e.g. Science (23 Nov 2012), vol 338(6110), pp.1016-1017 ). Considering how many people have already been familiar with this type of plot, I feel that very little new is added in this paper: maybe only a new name ("trumpet"), and/or the power lines. The other methods contributions (log-x, one variant per LD, avoiding gene-level statistics) are rather straightforward. People without experience with "shiny" (R package) can still use ggplot2 or plot in R to get the same result. Generally speaking, I think the paper is weak, though OK as a program/package announcement.

      Major comments: * I think the trumpet shape (increase of "effect size" for rare variant) is probably a direct consequence of using odds-ratio as a measure of effect size. If the allele frequency in normal population is p0, that in disease population is p1, [p1/(1-p1)]/[p0/(1-p0)] ~ p1/p0 tends to be large for small p0's, simply because the denominator is small. On the other hand, if population attributable risk (p0(RR-1)/(1+p0(RR-1))) is used as the y-axis, I am uncertain what the shape of the plot would be.

      • A risk allele has these pieces of information:
      • allele frequency,
      • effect size (e.g. odds ratio),
      • type-I error/p-value,
      • type-II error/power. The plot in this paper show #1 vs #2 and #4 being added as extra. In another publication with a proposal to plot genetic association results (Comp Biol. and Chem. (2014), 48:77-83 doi: 10.1016/j.compbiolchem.2013.02.003), #2 is against #3 with #1 being an added extra. I'm sure using other combinations could lead to other types of plots. The authors should discussion/compare these possibilities.

      Minor comments: In Fig.1, the size of the dots, the brown vs cyan color, the discontinuity of scatter dots around 0.01, are not explained.

      Re-review:

      I have read authors' response and I'm mostly satisfied. Only two minor comments: * Witte 2014 Nature Rev. Genet. article summarizes the point I tried to make well. I understand that rare variants should have a relatively higher effect from an evolutionary perspective, but since these are rare, their individual or even collective contribution to a disease in the population is still small. A casual reader may not realize this point and I think it would be helpful to cite Witte's article. * My minor comment on Fig.1 is still not addressed: there seem to be more points on the right side of p=0.01 line than the left side. Why this discontinuity? (the added text in Revision is about the color and size of the dots, not about this discontinuity)

    1. De novo

      Xupo Ding 1. The CDS and protein sequences could not extracted from the file of masked.fasta with gff3 file when verifying the accuracy of genes loci and related proteins. The extract software is gffread in cufflinks 2.1.1. Please confirm the final assembly file that would upload to GigaDB.2. Confirmed the accuracy of gene predication, especially for ks calculation.3. Before the repeat masked with the software of Repeatmasker, the final sequences were scanned with LTR_retriever and the LAI index have been generated in this folder. The LAI values were 20.55 and 18.06, which could be classified the haplogenome assembly as the reference or gold level, please describe the LAI values after busco completeness in the revised manuscript.4. The percentages of two largest subfamilies of LTR, Gypsy and Copia, were not presented in the supplementary TableS5.5. Two Eucalyptus genomes have been published (Nature 2014; Gigascience, 2020) and they were all not analysis the LTR insert time in detail. The insert times of all TE, Gypsy and Copia would highlighted this manuscript, especially the basic data have been presented with *.list in the LTR_harvest and LTR_retriever scan.6. Did the special genes of each haplogenome classify? Which pathways or Go terms they enriched in?7. Some SVs may be associated with the plant traits. The genes distributing in the regions of different SVs type should be furtherly identified and enriched with GO and KEGG.8. "Syntenic gene pairs between the E. grandis and E. urophylla haplogenomes were identified using a python version of MCScan, JCVI v1.1.18."Syntenic gene pairs in Figure 4 seemed only from JCVI,not using MCScan.9. The reference cite should be consistent, such as Candotti et al in the section of Genome scaffolding should be revised.10. Language should be improved and modified by academic editor.

    2. Summary

      Chao Bian: This study, entitled "Haplogenome assembly reveals interspecific structural variation in Eucalyptus hybrids", has reported two haplotypes from Eucalyptus grandis and E. urophylla.Both genomes are of high quality and high completeness. Nevertheless, why not directly and separately sequenced the Eucalyptus grandis and E. urophylla, and separately assembled each genome? In this way, the authors will not perform so much assembling steps to distinguish haplogenome.On the other hand, the authors have written a large paragraph to show the SV and SNP between both Eucalyptus species. However, the author only shown the number of SVs and SNPs, but did not show any relationship between the SV and biological characters. Could some SVs and SNPs involved in or impacted some genes can interpret some biological difference between Eucalyptus grandis and Eucalyptus grandis?In my view, only showing the number of SVs and SNPs is indeed fruitless for wide interests of this study. Some biological stories should be reported in a genome study.Please provide new figures with higher resolution. These figures are too much unclear.Please use the novel version of BUSCO V5.2.2, and indicate the used library.What's the QUAST assessment result in this study?The English language of this paper needs to be largely polished. Too much spelling and mistakes were appeared in the manuscript.Some minor suggestions:The decimal places should be uniform, such as "(567 Mb and 545 Mb) to 97.9% BUSCO completion" and "scaffold N50 of 43.82 Mb and 42.45 Mb for the E. grandis and E. urophylla haplogenomes, respectively".In 'All scripts used in this study is available on github.', 'is' should be 'are'.The language of this sentence should be revised "Illumina short-reads were used for k-mer based genome size estimation was performed using Jellyfish v2.2.6 (Jellyfish, RRID:SCR_005491) [25] for 21- mers and visualised with GenomeScope v2.0"For scaffolding step, why the authors removed all contigs smaller than 3kb?'The predicted gene space was' should be 'The predicted gene spaces were'.For "a contig N50 of 3.91 Mb 1." and 'was greater than 88.0% 2', what're meaning of the last '1' and '2' in these sentences.In this sentence 'Approximately 3.3 μg of HMW DNA from was used without', 'from' what?"a BUSCO completeness score of at least 95.3% was obtained for contigs anchored to one of the eleven chromosomes.", for one of the eleven chromosomes? Why contigs were only anchored to one chromosome?Revise 'markers each.,'."BUSCO completeness scores of 94.6% and 95.8% was obtained", 'was' should be 'were'."Although there is a greater number of local variants compared to SVs", 'there is' should be 'there are'."respectively, Supplementary Table S3)" revised to 'respectively, (Supplementary Table S3)'.'Mbp' revised to 'Mb'.'assemblies was' should be 'assemblies were'.

    1. Background

      Ilan Gronau: This manuscript describes updates made to GADMA, which was published two years ago. GADMA uses likelihood-based demography inference methods as likelihood-computation engines, and replaces their generic optimization technique with a more sophisticated technique based on a genetic algorithm. The version of GADMA described in this manuscript has several important added features. It supports two additional inference engines, more flexible models, additional input and output formats, and it provides better values for the hyper-parameters used by the genetic algorithm. This is indeed a substantial improvement over the original version of GADMA. The manuscript clearly describes the different added features to GADMA, and then demonstrates them with a series of analyses. These analyses establish three main things: (1) they show that the new hyper-parameters improve performance; (2) they show how GADMA can be used to compare performance of different approaches to calculate data likelihood for demography inference; (3) showcase new features of GADMA (supporting model structure and inbreeding inference). Overall, the presentation is very clear and the results are interesting and compelling. Thus, despite being a publication about a method update, it shows substantial improvement, provides interesting new insights, and will likely lead to expansion of the user base for GADMA.The only major comment I have is about the part of the study that optimizes the hyperparameters. The hyper-parameter optimization is a very important improvement in GADMA2. The setup for this analysis is very good, with three inference engines, four data sets used for training and six diverse data sets used for testing. However, because of complications with SMAC for discrete hyperparameters, the analysis ends up considering six separate attempts. The comparison between the hyper-parameters produced by these six attempts is mostly done manually across data sets and inference engines. This somewhat beats the purpose of the well-designed set up. Eventually, it is very difficult for the reader to asses the expected improvement of the final suggested values of hyperparameters (attempt 2) to the default ones. I have two comments/suggestions about this part.First, I'm wondering if there is a formal way to compare the eventual parameters of the six attempts across the four training sets. I can see why you would need to run SMAC six separate times to deal with the discrete parameters. However, why do you not use the SMAC score to compare the final settings produced by these six runs?Second, as a reader, I would like to see a single table/figure summarizing the improvement you get using whatever hyper-parameters you end up suggesting in the end compared to the default setting used in GADMA1. This should cover all the inference engines and all the data sets somehow in one coherent table/figure. Using such a table/figure, you could report improvement statistics, such as the average increase in log-likelihood, or average decrease in convergence times. These important results get lost in the many improved figures and tables.These are my main suggestions for revisions of the current version. I also have some more minor comments that the authors may wish to consider in their revised version, which I list below.Introduction:===========para 2: the survey of demography inference methods focuses on likelihood-based methods, but there is a substantial family of Bayesian inference methods, such as MPP, Ima, and G-PhoCS. Bayesian methods solve the parameter estimation problem by Bayesian sampling. I admit that this is somewhat tangential to what GAMDA is doing, but this distinction between likelihood-based methods and Bayesian methods probably deserves a brief mention in the introduction.para 2,3: you mention a result from the original GADMA paper showing that GADMA improves on the optimization methods implemented by current demography inference methods. Readers of this paper might benefit of a brief summary of the improvement you were able to achieve using the original version of GADMA. Can you add 2-3 sentences providing the highlights of the improvement you were able to show in the first paper?para 3: The statement "GADMA separates two regular components" is not very clear. Can you rephrase to clarify?Materials and methods - Hyper-parameter optimization:==============================================I didn't fully understand what you use for the cost function in SMAC here. Seems to me like there are two criteria: accuracy and speed. You wish the final model to be as accurate as possible (high log likelihood), but you want to obtain this result with few optimization iterations. Can you briefly describe how these two objectives are addressed in your use of SMAC? It's also not completely clear how results from different engines and different data sets are incorporated into the SMAC cost. Can you provide more details about this in the supplement?para 2: "That eliminate three combinations" should be "This eliminates three combinations".para 3: "Each attempt is running" should be "Each attempt ran"para 3: "We take 200×number of parameters as the stop criteria". Can you clarify? Does this mean that you set the number of GADMA iterations to 200 times the number of demographic model parameters? Why should it be a linear function of the number of parameters? The following text explains the justification, butTable 1: I would merge Table S2 with this one (by adding the ranges of all hyper-parametres as a first column). It's important to see the ranges when examining the different selections.Materials and methods - Performance test of GADMA2 engines:=====================================================para 2: "ROS-STRUCT-NOMIG" should be "DROS-STRUCT-NOMIG" Also, "This notation could be read" - maybe replace by "This notation means" to signal that you're explaining the structure notation.Para 4 (describing comparisons for momi on Orangutan data): "ORAN-NOMIG model is compared with three …". You also consider ORAN-STRUCTNOMIG in the momi analysis, right?Results - Performance test of GADMA2 engines:========================================Inference for the Drosophila data set under model with migration: you mention that the models with migration obtain lower likelihoods than the models without migration. You cannot directly compare likelihoods in these two models, since the likelihood surface is not identical. So, I'm not sure that the fact that you get higher likelihoods in the models without migration is a clear enough indication for model fit. The fact that the inferred migration rates are low is a good indication for that. It also seems like despite converging to models with very low migration rates, the other parameters are inferred with higher noise. For example, the size of the European bottleneck is significantly increased in these inferences compared to that of the NOMIG. So, potentially the problem here is that more time is required for these complex models to converge.Inference for the Drosophila data set under structured model (2,1): the values inferred by moments and momentsLD appear to neatly fit the true values. However, it is not straightforward to compare an exponential increase in population size to an instantaneous increase. Maybe this can be done by some time-averaged population size, or the average time until coalescence in the two models? This will allow you to quantify how good the two exponential models fit the true model with instantaneous increase.Inference for the Orangutan data set under structured model (2,1) without migration: you argue that a constant population size is inferred for Bor by moments and momi because of the restriction on population sizes after the split. You base this claim on a comparison between the log-likelihoods obtained in this model (STRUCT-NOMIG) and the standard model (NOMIG) in which you add this restriction. I didn't fully understand how you can conclude from this comparison that the constant size inferred for Bor is due to the restriction on the initial population size after the split. I think what you need to do to establish this is run the STRUCT model without this restriction and see that you get exponential decrease. Can you elaborate more on your rationale? A detailed explanation should appear in the supplement and a brief summary in the main text.Inference for the Orangutan data set with models with pulse migration: This is a nice result showing that the more pulses you include, the better the estimates become. However, your main example in the main text uses the inferred migration rates. This is a poor example, because migration rates in a pulse model cannot be compared to rates in a continuous model. If migration is spread along a longer time range, then you expect the rates to decrease. So, there is no expectation of getting the same rates. You do expect, however, to get other parameters reasonably accurate. It seems like this is done with 7 pulses, but not so much with one pulse. This should be the main the focus of the discussion of these results.Results - inference of inbreeding coefficients:======================================When you describe the results you obtained for the cabbage data set, you say "the population size for the most recent epoch in our results is underestimated (6 vs 592 individuals) for model 1 without inbreeding and overestimated (174,960,000 vs. 215,000 individuals) for model 2 with inbreeding". The usage of under/overestimated is not ideal here, because it would imply that the original dadi estimates are more correct. You should probably simply say that they are lower/higher than estimates originally obtained by dadi. Or maybe even suggest that the original estimates were over/underestimated?Supplementary materials:=====================Page 4, para2: "Figure ??" should be "Figure S1"Page 4, para 4: Can you clarify what you mean by "unsupervised demographic history with structure (2, 1)"?Page 22, para 2: "Compared to dadi and moments engines momentsLD provide slightly worse approximations for migration rates". I don't really see this in Supplementary Table S16. Estimates seem to be very similar in all methods. Am I missing anything? You make the same statement again in the STRUCT-MIG model (page 23).Page 22, para 4: "The best history for the ORAN-NOMIG model with restriction on population sizes is -175,106 compared to 174,309 obtained for the ORAN-STRUCT-NOMIG mod". There is a missing minus sign before the second log likelihood. You should also specify that this refers to the moments engine. Also see comment above about this result.

    2. Abstract

      Ryan Gutenkunst: In this paper, the authors present GADMA 2, an update of their population genomic inference software GADMA. The author's software serves as a driver for other population genomics software, enabling a consistent user interface and a different parameter optimization approach. GADMA 2 extends GADMA by adding two new inference engines: momi2 and momentsLD, hyperparameter optimization for the genetic algorithm, demes visualization, selection, dominance, and inbreeding modeling, and a new method for specifying model structures. In this paper, the authors show that their optimized genetic algorithm is somewhat more effective than the original hyperparameter settings. They also compare among inference engines, finding some differences in behavior. Lastly they compare with dadi itself in two models with inbreeding, finding better likelihood parameter sets than those previously published.GADMA has already found some use in the population genomics community, and GADMA 2 is a substantial update. The manuscript describes the updates in good detail and demonstrates the effectiveness of GADMA 2 on two real-world data sets. Overall, this is a strong contribution, and we have few major concerns.Major Technical Concerns:1) The authors claim to now support inference of selection and dominance. But what they support is very limited and not very biological. In particular, they currently support inferences which assume a single selection and dominance coefficient for the entire data set (as in Williamson et al. (2005) PNAS). In reality, any AFS will include sites with a variety of selection coefficients, usually summarized by a distribution of fitness effects. Since Keightley and EyreWalker (2007) Genetics, this has been the standard for inferring selection from the AFS. The authors should be clear about the limitations of what they have implemented.2) Figure 4 shows that optimization runs using GADMA 2 tend to find better likelihoods than bare dadi optimization runs. But the advice for using dadi or moments is to run multiple optimizations and take the best likelihood found, with some heuristic for assessing convergence. So most users would not (or at least should not) stop with the result of a single dadi optimization run. It does seem that GADMA 2 reduces the complexity of assessing convergence between multiple dadi optimization runs. But another important consideration is computational cost. (At an extreme, if each dadi run was 100 times faster than a single GADMA 2 run, then the correct comparison would be between the best of 100 dadi runs and a single GADMA 2 run.) It is not clear from the paper how the 100 GADMA 2 runs compare to the 100 dadi runs in terms of computational cost. It would be very helpful to have a table or some text describing the average computational cost (in CPU hours) of those runs.Major Writing / Presentation Concerns:1) Bottom of page 5: The authors are sharing the results of their hyperparameter optimizations from their own server, with uncertain lifetime. These results should be moved to an archival service such as Dryad.Minor Technical Concerns:1) The authors note that the DROS-MIG models had worse likelihoods than the DROS-NOMIG models. Since these are nested models, the DROS-MIG model must mathematically have a better global optimum likelihood. It would be worth pointing out that the likelihoods they found indicate a failure of the optimization algorithms. The authors should also present the DROS-MIG model results in a supplementary table.2) The Godambe parameter uncertainties in Tables S20 and S21 are pretty extreme, sometimes 10^-13 to 10^12. This may be due to instability of the Godambe approximation versus step size. In Blischak et al. (2020) Mol Biol Evol, the authors tried several step sizes and sought consistent results between them (Tables S1-S4). We suggest the authors take that approach here.Minor Writing / Presentation Concerns:1) The author claims that "GADMA does not require model specification". However, it seems that GADMA "structure model" rather describes a different and perhaps broader way to specify demographic models rather than completely eliminates model specification.2) The authors use the term "inference engine" for the four tools GADMA 2 builds upon. But to us, the act of inference includes parameter optimization. In this case, these tools are not being used for the inference itself, but rather to calculate the (composite) likelihood of the data. Perhaps a better term would be "likelihood calculator"?3) The authors suggest engine-specific hyperparameter optimization as a future goal. But the optimal hyperparameters are also likely to be model specific. (For example, 2- versus 4-population models might benefit from different optimization regimes.) Can the authors comment on this?Writing Nitpicks1) Abstract: "optimization algorithms used to find model parameters sometimes turn out to be inefficient" → vague: more details on why/how they are inefficient would be helpful2) Introduction: "Inference of complex demographic histories… in the population's past." needs citation.3) Page 2: "parameter to infer, for example, all migration" is a comma splice and should be split into two sentences.4) Supplement page 4: Figure ?? reference is broken.

    1. Background

      Michel Dumontier: This paper describes KGML-xDTD, a knowledge graph-based ML framework to predict and explain potential applications of drugs. The main approach is the use of graph reinforcement learning to predict drug-disease pairs and provide a knowledge-based path as a potential mechanism of action. The method is evaluated against other approaches, various data partitioning strategies, comparison to a manually curated database of mechanisms of actions, and two use cases. The paper is well written, easy to read, and makes a contribution to the scientific literature. Accurate prediction of drug uses remains an important and challenging problem in biomedical informatics. The novelty of the approach is to use graph reinforcement learning to achieve state of the art performance for the problem, and it also is able to generate plausible paths within a knowledge graph to serve as mechanistic explanations. There are some limitation to the work that should be addressed. These include: 1) The baseline models (GAT & GraphSAGE+SVM) only use a small subset of drug-disease replacements. The authors indicate that the smaller subset is necessary owing to time performance constraints. However, there is no discussion as to the possible impact the reduced subset any aspects in relation to their method. 2) The approach only evaluate 3-hop KG paths, which is 1/7 of what is available in DrugMechDB. What is the quality/performance impact of choosing longer paths? Wouldn't the the number of biologically reasonable paths to explain a predict be substantially reduced? I worry that this is cherry picking the dataset to show good performance for the only case (3-hop) that it is capable of (While critizing other methods as not being performant) 3) The authors use RepoDB as one of their sources, and specifically use the "withdrawn" set as true negatives. However, most withdrawn tags are linked to reasons other than safety or efficacy of the clinical trial. As such it is not clear that this set is a good true negative set. 4) The authors use MyChem as a resource for drug indications/contraindications. However, MyChem is not an original source - it aggregates other resources. The authors should properly identify the source of "human curated annotations". 5) I commend the authors for their evaluation, which uses a number of different train/test strategies and against different methods. However, as far as i can see the train/test strategy does not adequately remove similar true drugs-disease pairs from the training/test set. That is to say there are many drugs that are approved for very similar conditions, and therefore it becomes somewhat trivial to predict these (this problem is highlighted in the 2011 PREDICT paper by assaf gottlieb). More work should be done here to report an accuracy based on more stringent evaluation criteria. 6) It's unclear to me that the 124k diseases are real (diagnosable) diseases that could be prescribed for. Inflating the number of possible (but implausible) diseases might augment the performance, but contribute nothing to medicine. Elaborate. 7) Figures 5, 6 are difficult to read 8) It's nice to see the 2 use cases in the paper. However, the extracted subgraphs are quite different than the DrugMechDB MOA paths. So there's something to be said about the succinctness of the DrugMechDB MOA paths, which might prove to be a better training set for some explanation algorithm, rather that one that is independently generated. Overall, this is a nice paper with an interesting approach.

    2. ABSTRACT

      Yuansheng Liu: The paper entitled "KGML-xDTD: A Knowledge Graph-based Machine Learning Framework for Drug Treatment Prediction and Mechanism Description" proposes KGML-xDTD, a two-module, knowledge graph-based machine learning framework . Author constructs a large knowledge graph for the training of the model. The model is divided into two modules, one for drug repurposing prediction and the other for Mechansim Of Action Prediction. Both modules have achieved good results compared with the existing baseline. Here are my specific points: (1) It is mentioned on page 6 that the data are classified into three categories, while other data are classified into two categories. How did you exclude the "unknown" category and adjusted result? (2) Drug Repurposing Prediction model and Mechanism of Action Prediction model seems to be two separate training model. I can not find evidence of multitasking training from the content. If the model is trained separately, which model is the evaluation metrics according to? If training together, the model section should be written more clearly. (3) The introduction part only mentioned about Drug Repurposing Prediction Model, but it didn't describe existing Mechanism Of Action Prediction model. (4) Baseline seems to be Drug Repurposing Prediction SOTA model. But the best performance of the work is about Mechanism Of Action Prediction. (5) The data set appears to selectively chose drug-disease pairs with intermediate paths. But if the drug or disease in the network do not connect, that how dose Drug Repurposing Prediction model perform?

  12. Aug 2023
    1. AbstractRecent advances in bioinformatics and high-throughput sequencing have enabled the large-scale recovery of genomes from metagenomes. This has the potential to bring important insights as researchers can bypass cultivation and analyse genomes sourced directly from environmental samples. There are, however, technical challenges associated with this process, most notably the complexity of computational workflows required to process metagenomic data, which include dozens of bioinformatics software tools, each with their own set of customisable parameters that affect the final output of the workflow. At the core of these workflows are the processes of assembly - combining the short input reads into longer, contiguous fragments (contigs), and binning - clustering these contigs into individual genome bins. Both processes can be done for each sample separately or by pooling together multiple samples to leverage information from a combination of samples. Here we present Metaphor, a fully-automated workflow for genome-resolved metagenomics (GRM). Metaphor differs from existing GRM workflows by offering flexible approaches for the assembly and binning of the input data, and by combining multiple binning algorithms with a bin refinement step to achieve high quality genome bins. Moreover, Metaphor generates reports to evaluate the performance of the workflow. We showcase the functionality of Metaphor on different synthetic datasets, and the impact of available assembly and binning strategies on the final results. The workflow is freely available at https://github.com/vinisalazar/metaphor.Author summary

      **Reviewer 2. Po-Yu Liu **

      The Metaphor is a workflow with high completeness for short-read-based metagenomic analysis. I look forward to its compatibility with long-read platforms (ONT and PacBio). This work is worth publishing. However, it is still a bioinformatic knowledge and skill-required toolkit. If the Metaphor can be integrated into a web-based platform, such as Galaxy or Kbase, it would be more user-friendly for much more users.

    2. AbstractRecent advances in bioinformatics and high-throughput sequencing have enabled the large-scale recovery of genomes from metagenomes. This has the potential to bring important insights as researchers can bypass cultivation and analyse genomes sourced directly from environmental samples. There are, however, technical challenges associated with this process, most notably the complexity of computational workflows required to process metagenomic data, which include dozens of bioinformatics software tools, each with their own set of customisable parameters that affect the final output of the workflow. At the core of these workflows are the processes of assembly - combining the short input reads into longer, contiguous fragments (contigs), and binning - clustering these contigs into individual genome bins. Both processes can be done for each sample separately or by pooling together multiple samples to leverage information from a combination of samples. Here we present Metaphor, a fully-automated workflow for genome-resolved metagenomics (GRM). Metaphor differs from existing GRM workflows by offering flexible approaches for the assembly and binning of the input data, and by combining multiple binning algorithms with a bin refinement step to achieve high quality genome bins. Moreover, Metaphor generates reports to evaluate the performance of the workflow. We showcase the functionality of Metaphor on different synthetic datasets, and the impact of available assembly and binning strategies on the final results. The workflow is freely available at https://github.com/vinisalazar/metaphor.

      This work has been published in GigaByte Journal under a CC-BY 4.0 license (https://doi.org/10.1093/gigascience/giad055) and has published the reviews under the same license. These are as follows.

      **Reviewer 1. Thomas Brüls **

      The authors present a snakemake-based workflow to automate and chain the main computational ingredients (assembly and binning) of genome-centric metagenomics; the authors developed a technically sound tool for this purpose, and by itself it is certainly valuable to the research community and worth of publication. however, even if the article is casted as a technical note -hence with an emphasis on the design, implementation and assessment of the tool-, I feel that a more thorough discussion of both its abilities and inabilities (e.g. strain resolution, detection of low abundance organisms, identification of virus bins, etc) would be worth for a more general audience. On the same token, a more deep discussion of some of the results obtained with their tool (see below) would be of interest and would also illustrate useful use cases. I would suggest the following modifications/additions:-the experiments with the strain madness dataset suggest that the genomes (or fragments thereof, i.e. the bins) resolved should be viewed as "species" genomes, or composite genomes possibly originating from multiple strains. if so, do the authors think this represents a hard limit to the assembly + binning approach, or could further existing tools (e.g. performing variant detection on top of cross-assembly before the binning step) be integrated or developed in the future for strain-resolution (i.e. to identify strains not dominant in any sample)? -related, a simple summary of the number of individual strains recovered in individual bins for the strain madness experiment would be interesting.-another issue that would be worth discussing in my opinion is the impact of genome abundance on the recovery of corresponding bins and their quality. the platform developed by the authors appears to be well suited for such kind of analyses and the results would be of both theoretical and practical interest. to put it simply, what is the minimal initial coverage of genomes required in order for them to be recovered in bins of a given size and quality?-rem: theses two issues (strain-level diversity and individual strain genome abundances) likely interact to limit bin resolution, and this could be mentioned by the authors.-the data presented by the authors suggest that the metabat binning engine significantly outperforms the other two tools (concoct and vamb, which are both widely used), see e.g Figure 2; what would account for that, and do the authors think this is a general observation (i.e. beyond the specific CACB setting or marine metagenome shown in Fig 2)? -a bin refinement step (based on the DAS tool and dereplication) is frequently mentioned but should be more detailed (including a precise definition of the bin quality metric used).

      further rather minor comments: -in the abstract, when mentioning "technical challenges associated with...", it would be worth mentioning that algorithmic challenges are present as well. -in the introduction, "It is hypothesised that pooled assembly and binning may lead to improved results when analysing communities with high genetic diversity, and to poorer results when there is a high level of intraspecies/strain-level diversity". I would assume there are many instances in the real world that are both, i.e. that present both high inter-species and intra-species genetic diversity, what then?-in the future directions, the authors mention the identification of eukaryotic and viral contigs and bins, and could shortly elaborate how this could be done properly. -the sentence "In summary, our assessment of ..." at the end of the ms appears to have a syntactic problem.

    1. AbstractHetnets, short for “heterogeneous networks”, contain multiple node and relationship types and offer a way to encode biomedical knowledge. One such example, Hetionet connects 11 types of nodes — including genes, diseases, drugs, pathways, and anatomical structures — with over 2 million edges of 24 types. Previous work has demonstrated that supervised machine learning methods applied to such networks can identify drug repurposing opportunities. However, a training set of known relationships does not exist for many types of node pairs, even when it would be useful to examine how nodes of those types are meaningfully connected. For example, users may be curious not only how metformin is related to breast cancer, but also how the GJA1 gene might be involved in insomnia. We developed a new procedure, termed hetnet connectivity search, that proposes important paths between any two nodes without requiring a supervised gold standard. The algorithm behind connectivity search identifies types of paths that occur more frequently than would be expected by chance (based on node degree alone). We find that predictions are broadly similar to those from previously described supervised approaches for certain node type pairs. Scoring of individual paths is based on the most specific paths of a given type. Several optimizations were required to precompute significant instances of node connectivity at the scale of large knowledge graphs. We implemented the method on Hetionet and provide an online interface at https://het.io/search. We provide an open source implementation of these methods in our new Python package named hetmatpy.Competing Interest Statement

      **Reviewer 2. Paolo Provero **

      In this work Himmelstein and collaborators introduce a statistically controlled way of extracting significant node pairs in heterogeneous networks (hetnets) without relying on a ground truth and related training. The method "explains" why two nodes are significantly connected by extracting the metapaths most responsible for the enrichment. This is based on computing a null distribution of the DWPC, which allows assigning a P-value to each metapath joining two nodes, and then to visualize the individual paths responsible for the enrichment. The method is novel and significant, and can be in principle be applied to many hetnets, in life sciences and beyond, when a ground truth is not available or not desirable as it would introduce bias. The software tools developed appear to be readily available to other researchers.

      Major comment: If I understand correctly, given two nodes (say "Alzheimer disease" and "Circadian rhythm") the method extracts, in a statistically controlled way, the most significant metapaths joining the two nodes, and then the individual paths responsible for the enrichment. But this is not the most obvious question a life scientist would ask the network, which would be instead something like "Which are the pathways most significantly connected to "Alzheimer disease"? Indeed this type of question would be the one to ask when aiming for drug repurposing (possibly replacing "pathways" with "compounds" or "pharmacologic classes"). Based on Fig. 4A, the pathways are presented, or "suggested," in decreasing order of number of metapaths, but this is hardly a ranking by significance. Would it be possible to summarize the results in such a way as to rank the pathway nodes connected to a given disease node by significance (or more generally to rank the nodes of a certain type by the significance of their connection to a given node of another type)? This should be discussed.

      I also have several minor concerns. (1) The authors introduce and compute a null distribution of the DWPC which takes into account node degree in a statistically controlled way when evaluating the connectivity between two nodes. However, the DWPC itself does take into account node degree, as the name implies, and contains a tunable parameter that can be optimized, at least when a ground truth is available (as in Ref 39 by the same first author). I understand such tuning is not possible when, as in the present case, no ground truth is available, but the authors should make this point more clearly. (2) I find Fig. 1B a bit confusing: according to the legend, the top rows are known treatments, which should have higher than expected connectivity. However, based on the colors as explained by the legend, the bottom treatment/disease pairs seem to have higher connectivity (3) The acronym DWPC is defined after it has been used several times (4) The legend of Figure 2 should specify that these results apply to the nodes "Alzheimer disease" and "Circadian rhythm", although this becomes clear in Fig. 4 (5) I don't think Figure 3, representing the home page of the web site, is especially useful (6) I found Fig. 4 confusing: the sum of the path counts for the selected metapaths in panel B is way larger than the 425 results shown in Panel C. As far as I understand no path can belong to more than one metapaths, so is there some further selection here? (7) The "Frontend" section of the Methods seems a bit too detailed for the Gigascience audience.

      Re-review: The authors have addressed all my comments in a satisfactory way.

    2. AbstractHetnets, short for “heterogeneous networks”, contain multiple node and relationship types and offer a way to encode biomedical knowledge. One such example, Hetionet connects 11 types of nodes — including genes, diseases, drugs, pathways, and anatomical structures — with over 2 million edges of 24 types. Previous work has demonstrated that supervised machine learning methods applied to such networks can identify drug repurposing opportunities. However, a training set of known relationships does not exist for many types of node pairs, even when it would be useful to examine how nodes of those types are meaningfully connected. For example, users may be curious not only how metformin is related to breast cancer, but also how the GJA1 gene might be involved in insomnia. We developed a new procedure, termed hetnet connectivity search, that proposes important paths between any two nodes without requiring a supervised gold standard. The algorithm behind connectivity search identifies types of paths that occur more frequently than would be expected by chance (based on node degree alone). We find that predictions are broadly similar to those from previously described supervised approaches for certain node type pairs. Scoring of individual paths is based on the most specific paths of a given type. Several optimizations were required to precompute significant instances of node connectivity at the scale of large knowledge graphs. We implemented the method on Hetionet and provide an online interface at https://het.io/search. We provide an open source

      This work has been published in GigaByte Journal under a CC-BY 4.0 license (https://doi.org/10.1093/gigascience/giad047) and has published the reviews under the same license. These are as follows.

      **Reviewer 1. Karthik Raman **

      The paper is very well-written and addresses an important problem. The database appears easy to use and contains a lot of pre-computed data, which will be useful for researchers to query and generate useful insights. I only have a few minor comments, which if addressed, could further strengthen this manuscript.

      Minor comments: Without line and page numbers, it was a bit tricky to point out the issues.

      1. "One such application" in the introduction does not read well - just "one application"2. It is nice to see that DWPCs that are not retained by the database can be generated on the fly. The para goes on to mention "while still allowing on-demand access to the full metrics for all metapaths with length ≤ 3" --- is it also possible to generate metrics for longer paths if needed?

      2. Below Fig 2, there is a point about the adjusted p-value. I see that the discussion about FDR is presented later in the manuscript (and well justified), but there could be a pointer here to that section.

      3. Is there a possibility to include other computations like betweenness centrality and motifs also? This kind of data looks really ripe for an excellent analysis of repeated motifs etc.

      4. I found the Methods extremely long and may be a bit distracting for readers of this manuscript --- I was wondering if some of these can be moved to Supplementary.

      5. In the section on "Details of matrix DWPC implementation", it is stated that "our matrix methods were validated". It is not clear where these validations have been discussed.

      Supplementary? 7. In the section on "Permuted hetnets", it is not fully clear what the parameters for XSwap algorithm was. What were the parameters, e.g. number of swaps, etc.?

      1. In the section on "Details of the gamma-hurdle distribution", there is perhaps a missing equation below the second statement of "The probability of a draw from the distribution is"

      2. The validation here which points to an ipynb, could be put in Supplement.

      3. In the section on "Prioritizing enriched metapaths for database storage", what is the logic underlying the choice of parameters? "For metapaths with length ≥ 2, we chose an adjusted pvalue threshold of 5 × (nsource × ntarget)^−0.3".

      4. Under "Visual Design", are the colours chosen "colour-blind friendly"?

    1. AbstractScientists employing omics in life science studies face challenges such as the modeling of multi assay studies, recording of all relevant parameters, and managing many samples with their metadata. They must manage many large files that are the results of the assays or subsequent computation. Users with diverse backgrounds, ranging from computational scientists to wet-lab scientists, have dissimilar needs when it comes to data access, with programmatic interfaces being favored by the former and graphical ones by the latter.We introduce SODAR, the system for omics data access and retrieval. SODAR is a software package that addresses these challenges by providing a web-based graphical user interface for managing multi assay studies and describing them using the ISA (Investigation, Study, Assay) data model and the ISA-Tab file format. Data storage is handled using the iRODS data management system, which handles large quantities of files and substantial amounts of data. SODAR also offers programmable APIs and command line access for metadata and file storage.SODAR supports complex omics integration studies and can be easily installed. The software is written in Python 3 and freely available at https://github.com/bihealth/sodar-server under the MIT license.Competing Interest StatementThe authors have declared no competing interest.

      **Reviewer 2. Philippe Rocca-Serra **

      The reviewer thanks the authors for their efforts in producing the submitted manuscript. The authors describe a django based web application designed to support data management. The tool is built to support experimental metadata capture using the ISA format in its tsv form. The tool relies on irods to manage data files associated with the experimental metadata. The tool offers programmatic access via an API and clear front end.

      Main comments: The title: "SODAR: enabling, modeling, and managing multi-omics integration studies" could be clearer.Being more concise "SODAR: standard compliant management of multi-omics studies " would deliver a better message. Page 1 , Abstract: it would benefit from further refinement as there are several repetitions. Check 3rd sentence for English. "ranging from....to..." , s/whereas/to/"Scientists from diverse backgrounds also have different demands for interfacing with the data, ranging from computational users that need programmatic or command line access whereas non-computational users need graphical interfaces. "to:"Scientists, with different backgrounds, ranging from computational scientists to wet-lab scientists, have different needs when it comes to data access, with programmatic interfaces being favoured by the former and graphical ones by the latter". Instead of saying "under a permissive licence", be more explicit and plainly state "under MIT licence. "Page 2, Introduction:what is the difference between " data analysis and integration of data"? Repetition/redundancy in "An example of such complex study is (Esterhuyse et al., 2015) in infection biology, which will be used as an example below. "Suggestion:Use of term "modeling": using "plan" or "planning" may be better to remove any ambiguity about the nature of the modelling (statistical modeling, data modeling). Alternating, perfer 'representation' or 'representing'. (the term model is repeated many times in the following sentences) The statement "The most comprehensive standard for describing study metadata is the ISA-Tab format ..." is probably too strong. There are more formal (UML) models such as FUGE-OM (https://doi.org/10.1038/nbt1347 ) or CDISC SDM & SDTM.A more understated assessment such as "a popular standard, owing to its simplicity, is the ISA-Tab format""Alternatives include..." possibly cite other options for managing such complex datasets as seen with BIDS in neuroscience (Gorgolewski, K., Auer, T., Calhoun, V. et al. The brain imaging data structure, a format for organizing and describing outputs of neuroimaging experiments. Sci Data 3, 160044 (2016). https://doi.org/10.1038/sdata.2016.44) or why not mention HDF5 specification. This section could be improved by refining the transitions between the different ideas presented or organising the flow. For example, by layout out the challenges of 1/ dealing with experimental metadata and 2/ dealing with digital objects produced by instruments, which have the characteristics outlined by the authors (volume, depth). Then review the technical solutions and then present the choices made by this implementation and possibly identify the selection criteria which led to choosing one specification over another. Results:Page 4: " Non-computational users can interface with SODAR using the graphical UI, whereas computational users can use command line interfaces and REST APIs from scripts and other external software. "Repeat from the abstract. I would suggest rephrasing to 'humanise' 'computational users' vs 'non-computation users', and identifying the function and roles in actual labs (bioinformaticians, data analysts, aka dry lab scientists) vs (experimentalists, wet-lab biologists). Figure 1: same comment (in fact confirming by the choice of characters).a question about the diagram: Is it the case that the Web UI does not talk to server via the API as done in some modern development. Probably highlight there the reliance on the Django framework. Section 2.1The first sentence needs attention, check the English. "for both serving for modeling experiments..."Also, there are systems (EBI Metabolights tools on their github repo, DataVerse, FAIRdom SEEK, Zendro...).So the story telling should probably first talk about the survey of the existing and then only bring to arguments justifying new development. Table 1.It is odd to lump blanket statements for tools such as LIMS, ELN or 'Study Databases' without clearly stating which ones specifically have been evaluated. It seems that one could formulate a table with very different results.

      Question: How was selection bias controlled for? Page 5:This section should be reorganised and each explanatory statement refined to add clarity. Case in point:"Arbitrary Experiments": Does experiment equate 'ISA.Assay'? is it akin to a Workflow or process Sequence ? Question: among the key feature that such a system should have to support the work of dry/wet lab scientists, surely, deposition to public repositories should be high on the list. Why is this absent? Page 6:typo: s/bioinfsormaticians/bioinformaticians/punctuation: to be checked: missing commas make for a difficult read.suggestion: simplify the role of 'experimentalists' in the context of SOBAR."They use the templates provided by the Data Stewards to instantiate a wet lab track and track its metadata." Question: How are data stewards trained in ISA-Tab? Access to the demo tool gives the opportunity to use and test the component. While the UI is simple and intuitive, a number of limitations in the editing functionality make usage more difficult that it needs to be.Page 7:"of course, using the REST-API of SODAR, it is possible to automate these tasks" Could the author produce a jupyter notebook showing how to do so? It would be a nice addition and possibly a good resource that could facilitate uptake. Section 2-3:page 8-9-10: this section could be streamlined and condensed to really focus on the interaction between shaping a sample processing & data acquisition workflow into a template which can be used by a wet lab scientists. All this while allowing a markup with ontology terms. Note: the ontology terms on the demo server do not resolve properly. Question: Why choosing Bioportal over other services, e.g. EBI OLS? Question: How can value-sets be constrained in SODAR? Question: ontology browser: it is unclear if the ontologies need to be loaded locally or if they are accessed via an API call to the relevant services ? Can the authors clarify this point? the demo server did not seem to allow it or I wasn't able. may be a figure showing the functionality would help? Page 11: Internal Usage Statistics Question: it seems that the mean size of an experiment stored in SODAR is ~60 samples and about 10 files per sample. These are relatively small sized studies. Can the authors provide insights about the performance of the platform with large studies (several thousands of samples and above)?

      Methods: Question: Installation and deployment of SODAR.Why the authors omit to mention that SODAR can be deployed via Docker? It seems useful information. Question: AltamISAChecking the library, it seems that development has stalled. It is a concern? Have the authors tested swapping AltamISA with ISA-API ? Is it at all possible ? could it be made via an adaptor of some sort? Can Altam ISA convert to ISA-JSON or other public repository compatible format to provide a capability to assist users disseminate their results? Comment: figure 3 should not be a supplementary material but a proper content as it is useful as showcasing SODAR UI and customization.

      Re-review: The reviewer thank the authors for their efforts and extensive rework of the manuscripts, and for delivering this software stack. minor corrections:


      page 4, 2nd paragraph, first sentence: typo -> s/approaching itusing/approaching it using/page 7, 2nd paragraph, suggested edit:change from: "For publication, raw and processed data and metadata are deposited in scientific catalogues, study databases and registries. An example is the BioSamples database for metadata [22].""to:For publication, metadata and raw or processed data are deposited in scientific catalogues, study databases and registries. Examples are the BioSamples database for metadata [22] and Short Read Archive for raw sequencing data [citation needed]."

      "important clarifications: 1. this sentence makes a disservice to the manuscript: "Our work isrepresentative of the work typically done by core units in clinics. Clinical settings often deal with humans as their primary sample source. This implies controlled access of data, or not being allowed to share confidential data. Thus, developing support for hosting data in a public repository is not our aim. Likewise, uploading data to other public repositories has not been a priority. "Two reasons:- the first one is opening the can of worms of data governance and oversight of patient related information. I would steer clear of that in this piece.- the second one is because i would flip the argument around. "While deposition to public repositories was not necessarily the priority, the development of an (almost, see below ) ISA compliant system provides such a capability should the data owner need it" 2. in the result section, or in the documentation, a welcome addition would be example of templates for non-sequencing based assays. For instance, since the authors mentioned their need to support proteomics and mass-spectrometry users, it would make sense to highlight the templates available. In other words, it would help the target audience of the manuscript locate 'metadata profile definitions' (somewhat akin to ISA configurations) for specific assay types. If I have missed it from the manuscript or the github repo, please ignore. 3. "dialectic" ISA format:Several examples are available from the GitHub repository generally follow the ISA-Tab specifications but also introduce a local field: "Library Name". While such value would make sense in the official ISA specification, it is currently not supported. This leads to the creation of a diverging format.It would be sensible to keep the "Library Name" as an presentation label (for display in the UI) and substitute it to "Labeled Extract Name" when exporting outside the database to the tab format, in order to retain compatibility with other ISA parser and the official specifications. It could be added as an output option to the Altam-ISA parser in case deposition to public repositories is needed (e.g. EMBL-Metabolights). This would go some way in helping 'Interoperability' and would not be too onerous a change. Worth of note, I was recently made aware that ENA repository would be accepting submission in ISA-Tab and ISA-JSON format, hence raising this point to the authors. Suggestion: clarify this in the Methods section. Also, it seems the following example is missing 'Assay Name' and 'Raw Data File' fields:https://raw.githubusercontent.com/bihealth/sodar- paper/main/GSE96583_PBMC_Single-Cell_Demo_Project/a_PBMC_test_scRNAseq_nucleotide_sequencing.txt

    2. AbstractScientists employing omics in life science studies face challenges such as the modeling of multi assay studies, recording of all relevant parameters, and managing many samples with their metadata. They must manage many large files that are the results of the assays or subsequent computation. Users with diverse backgrounds, ranging from computational scientists to wet-lab scientists, have dissimilar needs when it comes to data access, with programmatic interfaces being favored by the former and graphical ones by the latter.We introduce SODAR, the system for omics data access and retrieval. SODAR is a software package that addresses these challenges by providing a web-based graphical user interface for managing multi assay studies and describing them using the ISA (Investigation, Study, Assay) data model and the ISA-Tab file format. Data storage is handled using the iRODS data management system, which handles large quantities of files and substantial amounts of data. SODAR also offers programmable APIs and command line access for metadata and file storage.SODAR supports complex omics integration studies and can be easily installed. The software is written in Python 3 and freely available at https://github.com/bihealth/sodar-server under the MIT license.

      This work has been published in GigaByte Journal under a CC-BY 4.0 license (https://doi.org/10.1093/gigascience/giad052) and has published the reviews under the same license. These are as follows.

      **Reviewer 1. Xiaotao Shen **

      The authors developed the SODAR tool, which supports multi-omics integration studies. This is a great tool that has a user-friendly interface and supports multi-omics integration. However, I have several concerns that need to be addressed before this manuscript can be considered to be published. How does the SODAR handle the multi-omics data that are from different samples? For example, the gut microbiome data from stool samples and proteomics data from blood samples, which may be from the same person but collected at different dates. Since SPDAR supports cell editing, so how does it make the metadata and expression data consistent automatically? The authors claim that the SODAR can support multi-omics integration studies. However, I didn't find out how SODAR can do that. Could the authors give more descriptions about that?

      Re-review: The authors have addressed all my comments and concerns.

    1. AbstractTransformer-based language models are successfully used to address massive text-related tasks. DNA methylation is an important epigenetic mechanism and its analysis provides valuable insights into gene regulation and biomarker identification. Several deep learning-based methods have been proposed to identify DNA methylation and each seeks to strike a balance between computational effort and accuracy. Here, we introduce MuLan-Methyl, a deep-learning framework for predicting DNA methylation sites, which is based on five popular transformer-based language models. The framework identifies methylation sites for three different types of DNA methylation, namely N6-adenine, N4-cytosine, and 5-hydroxymethylcytosine. Each of the employed language models is adapted to the task using the “pre-train and fine-tune” paradigm. Pre-training is performed on a custom corpus of DNA fragments and taxonomy lineages using self-supervised learning. Fine-tuning aims at predicting the DNA-methylation status of each type. The five models are used to collectively predict the DNA methylation status. We report excellent performance of MuLan-Methyl on a benchmark dataset. Moreover, we argue that the model captures characteristic differences between different species that are relevant for methylation. This work demonstrates that language models can be successfully adapted to applications in biological sequence analysis and that joint utilization of different language models improves model performance. Mulan-Methyl is open source and we provide a web server that implements the approach.Key points

      **Reviewer 2. Jianxin Wang **

      In this manuscript, the authors present MuLan-Methyl, a deep-learning framework for predicting 6mA, 4mC, and 5hmC sites. They use DNA sequence and taxonomic identity as features, and implement five popular transformer-based language models in MuLan-Methyl. MuLan-Methyl is open-sourced, and a web server is also provided for users to access it. Overall, I think the methodology of MuLan-Methyl is clear and innovative, and the experiments seem comprehensive. However, I do have several concerns that I believe should be addressed before the paper is accepted by GigaScience.

      Major 1. One major concern is that, in my opinion, DNA methylation is dynamic. Cytosines in the same position of the DNA sequence may have different methylation status in different samples, different cells, or even in different development stages of a cell. So, how can we predict the methylation status of a site based on only its sequence (and taxonomic identity)? -- The authors should clarify that in what cases, MuLan-Methyl (as well as other methods that use only DNA sequence) can be used to study DNA methylation, in Introduction or Discussion section. -- The authors discuss motifs in Fig. 3, but only for positive samples. How about the motif distribution in the negative samples? Can I understand that this method is actually for discovering motifs (or sequence structures) that are highly correlated with methylation? -- How is the performance of MuLan-Methyl without taxonomic identity? 2. The authors compared MuLan-Methyl against iDNA-ABF and iDNA-ABT, especially on the independent test set (Fig. 2E). I think the authors should clarify that whether they trained the models of the three methods using the same training datasets. If not, the authors should clarify the reason. 3. I'm curious about the computational efficiency of MuLan-Methyl. How many parameters in its model? Does MuLan-Methyl have advantages over other methods in terms of computational efficiency?

      Minor 1. I don't understand why the references were not ordered from 1 in the main text. 2. I suggest that the authors re-organize the Introduction section. There are too many small paragraphs in this section. 3. At the end of Page 2, "The type 4mC type is present in 4 species" should be corrected.

      Re-review:

      The authors have addressed most of my concerns. However, I still have one minor concern about the computational efficiency. The response of the authors is not convincing by only saying "The number of models that MuLan-Methyl need to train and test on is less than the others, thus it has better computational efficiency than other models to some extent". If possible, I strongly suggest that the authors show some data to compare how much time and resources (GPU/CPU/RAM) needed by each method. The authors have addressed most of my concerns. However, I still have one minor concern about the computational efficiency. The response of the authors is not convincing by only saying "The number of models that MuLan-Methyl need to train and test on is less than the others, thus it has better computational efficiency than other models to some extent". If possible, I strongly suggest that the authors show some data to compare how much time and resources (GPU/CPU/RAM) needed by each method.

    2. AbstractTransformer-based language models are successfully used to address massive text-related tasks. DNA methylation is an important epigenetic mechanism and its analysis provides valuable insights into gene regulation and biomarker identification. Several deep learning-based methods have been proposed to identify DNA methylation and each seeks to strike a balance between computational effort and accuracy. Here, we introduce MuLan-Methyl, a deep-learning framework for predicting DNA methylation sites, which is based on five popular transformer-based language models. The framework identifies methylation sites for three different types of DNA methylation, namely N6-adenine, N4-cytosine, and 5-hydroxymethylcytosine. Each of the employed language models is adapted to the task using the “pre-train and fine-tune” paradigm. Pre-training is performed on a custom corpus of DNA fragments and taxonomy lineages using self-supervised learning. Fine-tuning aims at predicting the DNA-methylation status of each type. The five models are used to collectively predict the DNA methylation status. We report excellent performance of MuLan-Methyl on a benchmark dataset. Moreover, we argue that the model captures characteristic differences between different species that are relevant for methylation. This work demonstrates that language models can be successfully adapted to applications in biological sequence analysis and that joint utilization of different language models improves model performance. Mulan-Methyl is open source and we provide a web server that implements the approach.

      This work has been published in GigaByte Journal under a CC-BY 4.0 license (https://doi.org/10.1093/gigascience/giad054) and has published the reviews under the same license. These are as follows.

      **Reviewer 1. Yupeng Cun **

      Zeng et al. proposed an ensemble framework for identifying three type DNA-methylation sites, and performed a benchmark comparison in multiple species' genomic data. This paper give a valuable study on how ensemble transfer learners works and the predictability in different species. My suggestion is the manuscript acceptable with following minor revision: 1. Calculated a consensus ranking using Kendall's tau rank distance method for each method in Figure 2-C. 2. the multi-head self- attention and self-attention head formula should redescribed by following this preprint: https://arxiv.org/pdf/1706.03762.pdf 3. MLM and MuLan-Methyl mixed in some cases, which need be used in a consensus way.

    1. AbstractBackground The domesticated turkey (Meleagris gallopavo) is a species of significant agricultural importance and is the second largest contributor, behind broiler chickens, to world poultry meat production. The previous genome is of draft quality and partly based on the chicken (Gallus gallus) genome. A high-quality reference genome of Meleagris gallopavo is essential for turkey genomics and genetics research and the breeding industry.Results By adopting the trio-binning approach, we were able to assemble a high-quality chromosome-level F1 assembly and two parental haplotype assemblies, leveraging long-read technologies and genomewide chromatin interaction data (Hi-C). These assemblies cover 35 chromosomes in a single scaffold and show improved genome completeness and continuity. The three assemblies are of higher quality than the previous draft quality assembly and comparable to the current chicken assemblies (GRCg6a and GRCg7). Comparative analyses reveal a large inversion of around 19 Mbp on the Z chromosome not found in other Galliformes. Structural variation between the parent haplotypes were identified in genes involved in growth providing new target genes for breeding.Conclusions Collectively, we present a new high quality chromosome level turkey genome, which will significantly contribute to turkey and avian genomics research and benefit the turkey breeding industry.Competing Interest Statement

      **Reviewer 2. Luohao Xu **

      This manuscript by Barros et al. presents a high-quality dipoid turkey genome assembly which shows significant improvement relative to the previous one. This new assembly is timely and will likely be used as the reference turkey genome, but the authors should acknowledge that the W chromosome is absent (because the F1 individual was a male?). This manuscript fits more with "Data Note" than "Research" as I see most results are descriptive and confirmatory. While the chromosomal assembly is relatively complete, I am concerned whether it still contains assembly errors (because of not being polished by long reads?) which led to fewer genes annotated. This assembly metric needs to be taken into accounts if this assembly were to be used as a reference. The authors need to provide the QV value (see the VGP standard), and evaluate indel errors in coding regions. Some of the results are very brief without showing details or a figure, so difficult for assessment, for instance those SVs affecting genes. Page 4, "two most important avian agricultural species", I think duck should be the second most important poultry species? Page 5, I believe the "F1 assembly" refers to the primary assembly or collapsed assembly - please define it more clearly. Page 6, it's unclear how the 36 chromosome models are defined, particularly for small microchromosomes (29-35). According to the karyotype of turkey (2n=80), a few chromosomal models are missing. Page 6, "This captures the chromosome arms in a single contig" does it apply to all chromosomes? This is unlikely, and data is not shown. Page 6, any idea why the coverage of two parents differs (110X vs. 137X)? Page 6, "anchored the assemblies to the F1 assembly using RagTag". This suggests and chromosomal assembly of the two haplotypes was not independent, and replied on the F1 assembly. This can potentially lead to missing structural variations between two haplotypes (inversions, translocations). Page 7, please show more data to support the correct assembly of the chrZ inversion, including Hi-C heatmap, and long-read alignment spanning the inversion breakpoints. Note the Z chromosome inversion has been reported in Zhang et al. 2011 (BMC genomics), which is not cited until in the Discussion. Page 8, it's possible some genes were not annotated because of the presence of indels in coding regions. The genome assembly QV value can be calculated to measure the error frequency (Rhie et al, 2021 Nature). Page 8, please provide a statistical result for gene density comparison. Page 8, at the bottom, please cite the sources of these bird genomes. Page 9, "Gene family contractions and expansions". These analyses were a bit crude. " Orthologous groups" is not equivalent to "gene family". Page 10, the phrase "F1 and parent assemblies" is confusing. Both haploid assemblies are derived from the diploid F1. Consider changing to "paternal and maternal genomes". Also, as I commented above, both parental chromosomal assemblies are based on the same reference (Mgal_WU_HG_1.0), so the contigs were ordered and placed in the same way. This process could mask the potential non-co-linear segments. For a more appreciated way to independently assemble two chromosome-level assemblies, see the marmoset diploid genome paper (Yang et al., 2021 Nature). Page 10, please use a figure to show the SV over the BLB2 gene. Page 11, again, please visualize the result on the MAN2B2, GEMIN8, RIMKLB and RALYL cases. Page 11, "Loss of function variation", I am wondering whether variations mentioned in this part are fixed in the corresponding populations? Page 11, "Knockouts of this gene lead.." reference is needed. Page 12, "Avian genomes are known to…" references are missing. Page 12, "Distinct genomic landscapes of turkey micro and macrochromosomes", some patterns have been described in the literature, for instance, 10.1111/nyas.13295. Please also perform some statistical analyses to support the claims, not just a figure. Page 13, "Conserved synteny within the Galliformes clade", please cite 10.1159/000078570 and 10.1007/s00412-018-0685-6 Page 13, "it is evident that especially the Z chromosome" also observed in 10.1038/s41559-019-0850-1 Page 13, "inversion of around 19 Mbp on the turkey Z" also reported in 10.1186/1471-2164-12-447 Page 14, "tail of the chicken Z chromosome lacks synteny" also reported in 10.1038/nature09172. This means figure S11 does not provide a novel finding. Page 14, "Combining long reads and genome-wide chromatin interaction data (Hi-C) enables the capture of chromosome arms in a single contig", again, is that correct, chromosome arms in a single contig? Page 18, it's known wtdgb2 assembly tends to contain errors, but it looks the authors did not use long reads for polishing, but only used short reads? Page 20, "The corrected reads from TrioCanu were mapped to the Triocanu assembly with Minimap2 v2.17-r941 (Minimap2, RRID:SCR_018550) [45], options -x map-pb", what was is used for? Page 20, "Duplicated sequences were removed." How was this done?

      Re-review The manuscript has been improved. After reading the revised manuscript, I have a few more concerns.

      Chromosome models. I suggest the chromosome naming should follow chicken's, e.g., chr6 can be chr2a, and the microchromosomes should be named according to chicken homology. I then noticed chr32 and chr35 do not have chicken homology which is very concerning. It is either due to novel. chromosomes (very unlikely), or the sequences could be an unlinked contigs. In either scenario, the chromosome models must be clarified. The authors should provide strong evidence to support the chromosome model assembly for chr32 and chr35, e.g. FISH images, Hi-C zoom-in view (Fig. S1 shows the whole genomes where the microchromosome models are not visible), synteny with chicken (note there is a new chicken assembly ASM2420605v1) or zebra finch chromosomes; otherwise, chi32 and chr35 can not be identified as a chromosome. Centromere and telomere. To support complete chromosome assembly, I suggest the authors provide information about the assembly of telomere and centromere sequences, e.g. the presence/absence of TTAGGG at chromosomal ends. Most galliformes microchromosome centromeres are known to contain a 41-bp satellite (10.1139/gen-2022-0012). The authors should investigate whether such centromere satellites are present in the assembly. Data availability. It appears the Hi-C data is not available in NCBI. The raw reads must be provided. In the abstract, there is not such term as "complete scaffold", please remove "complete". Again, I do not see the support for two chromosome models: chr32 and chr35. The chrZ inversion is highlighted in the abstract, but this is not a novel finding - the writing is thus misleading. Instead, the new genome assembly only CONFIRMS this inversion. The subtitle "Lineage specific expansion and contraction of protein-coding gene families" is unrelated to the following text. "a 1.47 Mbp inversion on chromosome 1" I am wondering if this is the centromere? According to chicken chr1 centromere position, it looks like so. In the Table 5, the Parent2 has a much large size of gained copy. Please show more details, e.g. chromosomal distribution "BLB2", is this gene associated with parent2-specific trait? Similarly, what about TRIM36, GRIA2 and MAN2B2, and LRRC41? "The inversion was supported by a normal alignment at the approximate breakpoints (Supplementary File 1: Table S7 - Figure S16) and by the HiC contact map". The writing here is unclear. Hi-c data does not show signal for inversion, instead, it only supports that the assembly is correct. Bellott et al 2020 should be Bellott et al 2017. "Centromeres, however, are too long to traverse reliably in most cases". I do not see any analyses on centromeres. PRJEB42643 does not contain Hi-C data

      Re-re-review A new chicken genome has been published during the revision: https://www.pnas.org/doi/10.1073/pnas.2216641120, I suggest the authors revise some parts of the manuscript: e.g. L66, L78, L83-85 L103, please make it clear only the F1 was sequenced with long-read. L117-142, those results are very interesting, but perhaps the language can be more concise. L231-236, this paragraph is not important, please either move them to supplementary material or remove them. In general, this manuscript can be much more streamlined. L310-315, this part has also been reported by Huang et al. 2023 PNAS, so this is not a novel finding. Please either streamline or remove it. L327, ref 36 is not a "recent" finding.

    2. AbstractBackground The domesticated turkey (Meleagris gallopavo) is a species of significant agricultural importance and is the second largest contributor, behind broiler chickens, to world poultry meat production. The previous genome is of draft quality and partly based on the chicken (Gallus gallus) genome. A high-quality reference genome of Meleagris gallopavo is essential for turkey genomics and genetics research and the breeding industry.Results By adopting the trio-binning approach, we were able to assemble a high-quality chromosome-level F1 assembly and two parental haplotype assemblies, leveraging long-read technologies and genomewide chromatin interaction data (Hi-C). These assemblies cover 35 chromosomes in a single scaffold and show improved genome completeness and continuity. The three assemblies are of higher quality than the previous draft quality assembly and comparable to the current chicken assemblies (GRCg6a and GRCg7). Comparative analyses reveal a large inversion of around 19 Mbp on the Z chromosome not found in other Galliformes. Structural variation between the parent haplotypes were identified in genes involved in growth providing new target genes for breeding.Conclusions Collectively, we present a new high quality chromosome level turkey genome, which will significantly contribute to turkey and avian genomics research and benefit the turkey breeding industry.

      This work has been published in GigaByte Journal under a CC-BY 4.0 license (https://doi.org/10.1093/gigascience/giad051) and has published the reviews under the same license. These are as follows.

      **Reviewer 1. Yunyun Lv **

      Reviewer Comments to Author: The turkey has importance for agriculture as it is the second contributor to word poultry meat production. This study completes a chromosome-scale genome assembly with long reads sequencing and use trio-binning approach to generate a haplotype-resolved turkey genome, which give scientific significance to further genetic studies within this species. However, I feel the content within this article need improvement. Some parts were unclear and hard to follow, I list some of them as below. After substantial revisions, I will suggest the publication.

      In abstract: The sentence "These assemblies cover 35 chromosomes in a single scaffold and show improved genome completeness and continuity" seems weird and hard to understand directly. Please revise it and make it clear. "The three assemblies are of higher quality than the previous draft quality assembly and comparable to the current chicken assemblies (GRCg6a and GRCg7)." Please indicate the parameters used for comparison clearly and how prove them with a higher quality. "Structural variation between the parent haplotypes were identified in genes involved in growth providing new target genes for breeding." The theoretical context of this sentence is not clear, so I suggest more information added to make it clear.

      Considering no statistic in the conclusion, I suggest the conclusion sentence can be revised as "we contribute a new high-quality turkey genome at chromosome-level, benefiting turkey genetics and other avian genomics research as well as turkey breeding industry."

      In the introduction: "Most of the chromosomes are small microchromosomes, while only a few macrochromosomes are present in the karyotype." Please clearly indicate how many microchormosomes in turkeys and chicken. "most of" is uninformative for readers. "and by current standards would be considered of draft quality". What is the current standards? Please indicate it clearly. "Ongoing efforts in producing high quality assemblies of the microchromosomes in avian genomes have been unsuccessful due to multiple causes" what the multiple causes represent for? Or the features of microchromsomes leads to the unsuccessful assembly as mentioned above? "For instance, improved annotation of (non)-coding genes benefits the functional interpretation of genome wide association studies (GWAS), and aids in identifying targets for gene editing", why are non-coding genes (I understand the non-coding genes are referred as regulatory regions, but actually, they are not real genes.) benefits …? Why protein-coding genes (structural genes) can not undertake the roles? "The genome assemblies of turkey (this paper) and chicken, however, are of considerably higher quality compared to other Galliforme species. This provides opportunities for an in-depth comparison between the two most important avian agricultural species." I cannot follow the logic of why the placement of this sentence is here. Obviously, it should be part of discussion after the comparison of turkey genome with other avian genomes. "In this study we use a relatively new technique, the trio-binning approach, to construct high quality haplotype-resolved turkey assemblies." I feel it is necessary to give an explanation of the term "trio-binning approach" as many readers do not understand what is standard for? And the long-reads sequencing technology within it also connect the former theoretical context closely.

      In results: Have you used other assemblers to complete the genome assembly? Such as flye, or nextdenovo, or mecat2 that may have better performance. Have you ever tried 3D-dna for chromosome-scale assembly? which may be better as my experience. The gene annotation should be assessed by BUSCOs.

      In discussion: "The quality of the assemblies presented in this study confirms the value of this method in not only providing a quality assembly but also in uncovering structural genomic variation." Please indicate which quality index that reflect your genomic assembly. "Thanks to these recent sequencing technologies, we are able to correct a number of wrongly oriented contigs in Turkey_5.1, a phenomenon often observed in short-read based assemblies." I feel this sentence is not formal in writing.

      Re-review: The author has carefully amended the work in response to my prior concerns, and the quality of the new version has greatly improved, hence it is suggested that the manuscript be accepted.

  13. Jul 2023
    1. Editor’s Assessment

      This work has generated metabolic models for the human pathogens Mycobacterium leprae and Mycobacteroides abscessus, alongside a new computational tool that can be used to identify potential drug targets. The standardised genomic scale metabolic models have been developed using the systems biology community standards for quality control and evaluation of models. After providing more detail on reproducibility, comparative performance of the models, and reuse, these resources are now published and are available for reuse by the global scientific community via the GigaDB, Biomodels, and PatMeDB repositories.

      This assessment refers to version 1 of this preprint.

    1. Background Hands-on training, whether it is in Bioinformatics or other scientific domains, requires significant resources and knowledge to setup and run. Trainers must have access to infrastructure that can support the sudden spike in usage, with classes of 30 or more trainees simultaneously running resource intensive tools. For efficient classes, the jobs must run quickly, without queuing delays, lest they disrupt the timetable set out for the class. Often times this is achieved via running on a private server where there is no contention for the queue, and therefore no or minimal waiting time. However, this requires the teacher or trainer to have the technical knowledge to manage compute infrastructure, in addition to their didactic responsibilities. This presents significant burdens to potential training events, in terms of infrastructure cost, person-hours of preparation, technical knowledge, and available staff to manage such events.Findings Galaxy Europe has developed Training Infrastructure as a Service (TIaaS) which we provide to the scientific commnuity as a service built on top of the Galaxy Platform. Training event organisers request a training and Galaxy administrators can allocate private queues specifically for the training. Trainees are transparently placed in a private queue where their jobs run without delay. Trainers access the dashboard of the TIaaS Service and can remotely follow the progress of their trainees without in-person interactions.Conclusions TIaaS on Galaxy Europe provides reusable and fast infrastructure for Galaxy training. The instructor dashboard provides visibility into class progress, making in-person trainings more efficient and remote training possible. In the past 24 months, > 110 trainings with over 3000 trainees have used this infrastructure for training, across scientific domains, all enjoying the accessibility and reproducibility of Galaxy for training the next generation of bioinformaticians. TIaaS itself is an extension to Galaxy which can be deployed by any Galaxy administrator to provide similar benefits for their users. https://galaxyproject.eu/tiaasCompeting Interest Statement

      This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giad048), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

      Elizabeth Ryder

      This technical note is an informative explanation of Training-Infrastructure-as-a-Service, which is a free service available to facilitate Galaxy training sessions. The service provides an easy way for instructors to set up infrastructure for trainings, enables learners to make progress through the training without long waiting times, and includes a dashboard through which instructors can easily monitor progress of learners. The article provides data showing the large number of events and locations that have benefited from using TIaaS. Because of the utility and general applicability of TIaaS, the article will be of interest to the readers of GigaScience.Minor suggestions:In the Development section: As a practical matter, it would be useful to know the typical timeline for approval of a training session. Also, can anyone who uses Galaxy become an instructor and request this service?In the Usage section, there is a sentence that reads, 'Class sizes have ranged considerably, from the median of 25 participants (std. dev 121) to a maximum of 1500 registrants for afully asynchronous (self-paced) course.' It's a little unusual to talk about a median and standard deviation, since medians are non-parametric measures and SDs are parametric and measured with respect to the mean. I'd suggest using the median and interquartile range instead. I think a histogram of class size distribution would be informative, similar to the event distributions in Fig. 4.Grammatical / spelling errors:I'm not sure why 'Findings' appears before 'Background' - perhaps an editing error?p. 2'a limiting factor for events with large number of participants, 'should read'with a large number of participants''by it's design'should read'by its design''which to to preference'should read'which to preference'p.4'univeristy'should read'university'p.5This sentence is hard to scan as written; I think it needs a semi-colon after 'cluster' to make sense. Galaxy Europe uses it with HTCondor, and job rules that allow spill over to the main cluster, new machines are brought up in an OpenStack cluster specifically for training events and destroyed afterwards.

    2. Background Hands-on training, whether it is in Bioinformatics or other scientific domains, requires significant resources and knowledge to setup and run. Trainers must have access to infrastructure that can support the sudden spike in usage, with classes of 30 or more trainees simultaneously running resource intensive tools. For efficient classes, the jobs must run quickly, without queuing delays, lest they disrupt the timetable set out for the class. Often times this is achieved via running on a private server where there is no contention for the queue, and therefore no or minimal waiting time. However, this requires the teacher or trainer to have the technical knowledge to manage compute infrastructure, in addition to their didactic responsibilities. This presents significant burdens to potential training events, in terms of infrastructure cost, person-hours of preparation, technical knowledge, and available staff to manage such events.Findings Galaxy Europe has developed Training Infrastructure as a Service (TIaaS) which we provide to the scientific commnuity as a service built on top of the Galaxy Platform. Training event organisers request a training and Galaxy administrators can allocate private queues specifically for the training. Trainees are transparently placed in a private queue where their jobs run without delay. Trainers access the dashboard of the TIaaS Service and can remotely follow the progress of their trainees without in-person interactions.Conclusions TIaaS on Galaxy Europe provides reusable and fast infrastructure for Galaxy training. The instructor dashboard provides visibility into class progress, making in-person trainings more efficient and remote training possible. In the past 24 months, > 110 trainings with over 3000 trainees have used this infrastructure for training, across scientific domains, all enjoying the accessibility and reproducibility of Galaxy for training the next generation of bioinformaticians. TIaaS itself is an extension to Galaxy which can be deployed by any Galaxy administrator to provide similar benefits for their users. https://galaxyproject.eu/tiaas

      This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giad048), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

      **Azza Ahmed **

      The paper is well-written and neatly reports on the development of Training-Infrastructure-as-a-Service (TIaaS), a free infrastructure resource originally developed by Galaxy Europe and the Gallantries project together with the Galaxy community. TIaaS is a step towards democratizing bioinformatics training, where infrastructure can be a major barrier- even in advanced and well-developed countries.I specially appreciate the value of this resource for instructors and students in low and middle income countries where infrastructure limitations may be exacerbated by the availability of well-trained system administrators able to cater specific training needs. It was indeed gratifying to see training events using TIaaS in such countries in the figure 3 map- especially that it is not clear TIaaS is deployed in such counties. The utility of the resource is self-evident: 438 training events in 48 months targeting > 19000 students. Thus, overall, I congratulate the authors for the success of their project, and the community for having such a great free resource at their disposal.

    1. Background Polygenic risk score (PRS) analyses are now routinely applied in biomedical research, with great hope that they will aid in our understanding of disease aetiology and contribute to personalized medicine. The continued growth of multi-cohort genome-wide association studies (GWASs) and large-scale biobank projects has provided researchers with a wealth of GWAS summary statistics and individual-level data suitable for performing PRS analyses. However, as the size of these studies increase, the risk of inter-cohort sample overlap and close relatedness increases. Ideally sample overlap would be identified and removed directly, but this is typically not possible due to privacy laws or consent agreements. This sample overlap, whether known or not, is a major problem in PRS analyses because it can lead to inflation of type 1 error and, thus, erroneous conclusions in published work.Results Here, for the first time, we report the scale of the sample overlap problem for PRS analyses by generating known sample overlap across sub-samples of the UK Biobank data, which we then use to produce GWAS and target data to mimic the effects of inter-cohort sample overlap. We demonstrate that inter-cohort overlap results in a significant and often substantial inflation in the observed PRS-trait association, coefficient of determination (R2) and false-positive rate. This inflation can be high even when the absolute number of overlapping individuals is small if this makes up a notable fraction of the target sample. We develop and introduce EraSOR (Erase Sample Overlap and Relatedness), a software for adjusting inflation in PRS prediction and association statistics in the presence of sample overlap or close relatedness between the GWAS and target samples. A key component of the EraSOR approach is inference of the degree of sample overlap from the intercept of a bivariate LD score regression applied to the GWAS and target data, making it powered in settings where both have sample sizes over 1,000 individuals. Through extensive benchmarking using UK Biobank and HapGen2 simulated genotype-phenotype data, we demonstrate that PRSs calculated using EraSOR-adjusted GWAS summary statistics are robust to inter-cohort overlap in a wide range of realistic scenarios and are even robust to high levels of residual genetic and environmental stratification.Conclusion The results of all PRS analyses for which sample overlap cannot be definitively ruled out should be considered with caution given high type 1 error observed in the presence of even low overlap between base and target cohorts. Given the strong performance of EraSOR in eliminating inflation caused by sample overlap in PRS studies with large (>5k) target samples, we recommend that EraSOR be used in all future such PRS studies to mitigate the potential effects of inter-cohort overlap and close relatedness.

      This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giad043), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

      Samuel Lambert (revision 2)

      I commend the authors for doing these extra analyses focused on more real-world applications of the method and adding them to the paper. I think the discussion is better contextualised and my final recommendation is that these warnings/caveats are placed in the software documentation as well (https://choishingwan.gitlab.io/EraSOR/).

    2. Background Polygenic risk score (PRS) analyses are now routinely applied in biomedical research, with great hope that they will aid in our understanding of disease aetiology and contribute to personalized medicine. The continued growth of multi-cohort genome-wide association studies (GWASs) and large-scale biobank projects has provided researchers with a wealth of GWAS summary statistics and individual-level data suitable for performing PRS analyses. However, as the size of these studies increase, the risk of inter-cohort sample overlap and close relatedness increases. Ideally sample overlap would be identified and removed directly, but this is typically not possible due to privacy laws or consent agreements. This sample overlap, whether known or not, is a major problem in PRS analyses because it can lead to inflation of type 1 error and, thus, erroneous conclusions in published work.Results Here, for the first time, we report the scale of the sample overlap problem for PRS analyses by generating known sample overlap across sub-samples of the UK Biobank data, which we then use to produce GWAS and target data to mimic the effects of inter-cohort sample overlap. We demonstrate that inter-cohort overlap results in a significant and often substantial inflation in the observed PRS-trait association, coefficient of determination (R2) and false-positive rate. This inflation can be high even when the absolute number of overlapping individuals is small if this makes up a notable fraction of the target sample. We develop and introduce EraSOR (Erase Sample Overlap and Relatedness), a software for adjusting inflation in PRS prediction and association statistics in the presence of sample overlap or close relatedness between the GWAS and target samples. A key component of the EraSOR approach is inference of the degree of sample overlap from the intercept of a bivariate LD score regression applied to the GWAS and target data, making it powered in settings where both have sample sizes over 1,000 individuals. Through extensive benchmarking using UK Biobank and HapGen2 simulated genotype-phenotype data, we demonstrate that PRSs calculated using EraSOR-adjusted GWAS summary statistics are robust to inter-cohort overlap in a wide range of realistic scenarios and are even robust to high levels of residual genetic and environmental stratification.Conclusion The results of all PRS analyses for which sample overlap cannot be definitively ruled out should be considered with caution given high type 1 error observed in the presence of even low overlap between base and target cohorts. Given the strong performance of EraSOR in eliminating inflation caused by sample overlap in PRS studies with large (>5k) target samples, we recommend that EraSOR be used in all future such PRS studies to mitigate the potential effects of inter-cohort overlap and close relatedness.

      This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giad043), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

      Samuel Lambert (revision 1)

      The revised manuscript is much clearer and better illustrates when and how to use the EraSOR method. However, I still think important analyses reflecting more common use cases are missing:- Use of EraSOR with multi-ancestry summary statistics- Use of EraSOR corrected sumstats with other PGS-derivation methods (e.g. LDpred or PRS-CS).- Providing results of a real sensitivity analysis for sample overlap. I understand that you won't know the true overlap in UKB but the difference in the adjusted and unadjusted SumStats performance in the presence of known overlap would be illustrative. Adding these analyses to the real UKB section would greatly benefit the manuscript and utility of the method. Apart from that I note that related to line 19, the impact of sample overlap was also outlined as a pitfall by Wray et al Nat Genet (2013, PMID:23774735).

    3. Background Polygenic risk score (PRS) analyses are now routinely applied in biomedical research, with great hope that they will aid in our understanding of disease aetiology and contribute to personalized medicine. The continued growth of multi-cohort genome-wide association studies (GWASs) and large-scale biobank projects has provided researchers with a wealth of GWAS summary statistics and individual-level data suitable for performing PRS analyses. However, as the size of these studies increase, the risk of inter-cohort sample overlap and close relatedness increases. Ideally sample overlap would be identified and removed directly, but this is typically not possible due to privacy laws or consent agreements. This sample overlap, whether known or not, is a major problem in PRS analyses because it can lead to inflation of type 1 error and, thus, erroneous conclusions in published work.Results Here, for the first time, we report the scale of the sample overlap problem for PRS analyses by generating known sample overlap across sub-samples of the UK Biobank data, which we then use to produce GWAS and target data to mimic the effects of inter-cohort sample overlap. We demonstrate that inter-cohort overlap results in a significant and often substantial inflation in the observed PRS-trait association, coefficient of determination (R2) and false-positive rate. This inflation can be high even when the absolute number of overlapping individuals is small if this makes up a notable fraction of the target sample. We develop and introduce EraSOR (Erase Sample Overlap and Relatedness), a software for adjusting inflation in PRS prediction and association statistics in the presence of sample overlap or close relatedness between the GWAS and target samples. A key component of the EraSOR approach is inference of the degree of sample overlap from the intercept of a bivariate LD score regression applied to the GWAS and target data, making it powered in settings where both have sample sizes over 1,000 individuals. Through extensive benchmarking using UK Biobank and HapGen2 simulated genotype-phenotype data, we demonstrate that PRSs calculated using EraSOR-adjusted GWAS summary statistics are robust to inter-cohort overlap in a wide range of realistic scenarios and are even robust to high levels of residual genetic and environmental stratification.Conclusion The results of all PRS analyses for which sample overlap cannot be definitively ruled out should be considered with caution given high type 1 error observed in the presence of even low overlap between base and target cohorts. Given the strong performance of EraSOR in eliminating inflation caused by sample overlap in PRS studies with large (>5k) target samples, we recommend that EraSOR be used in all future such PRS studies to mitigate the potential effects of inter-cohort overlap and close relatedness.

      This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giad043), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

      Samuel Lambert

      In this paper Choi et al. describe EraSOR, a new tool to remove the effects sample overlap between a set of summary statistics and a target dataset. EraSOR works by running a GWAS in the target dataset and then using LD-score regression techniques to estimate the heritability, genetic correlations of the phenotypes, and number of overlapping samples to decorrelate the effect sizes. The method is thoroughly described, and the simulation scenarios are relevant and well-motivated. However, the manuscript could better describe the inputs and characteristics of the decorrelated summary statistics, focusing more on the degree of bias in effect sizes rather than p-value inflation, and the practicalities of how the tool may be used.Specific Comments: The results of Figure 1/Supp Figure 1 are highly motivating, but the p-value of the association doesn't seem like the perfect measure of inflation. Plots of the effect size of the PRS compared to its expected effect (0, based on heritability) would better illustrate this. The paper proposes a method to remove the effects of sample overlap on summary statistics, but instead mostly focuses on how overlap biases the results of PRS prediction. Additional exploration of the decorrelated summary statistics themselves is needed to illustrate the validity of the method. Specifically, how different are the EraSOR adjusted summary statistics from the true summary statistics measured without sample overlap (e.g. distribution of effect sizes differences); what types of variants does EraSOR fail for or overcorrect (e.g. MAF differences between the summary statistics and the target cohort)? Are the results used as-is in other analyses, or do they have to be filtered in some way? The PRS analyses in the paper all use PRSice to perform clumping+thresholding, selecting the best p-value and LD thresholds on the target datasets. This could be considered overfitting to the target data, and other derivation methods that do not require a sample to optimize hyperparameters (e.g. PRS-cs, LDpred-auto) could be used. It would be good to provide some additional analyses showing that EraSOR outputs also work with other methods of PRS derivation, and that the results are not sensitive to overfitting through hyperparameter optimization. The PRS analysis of the real phenotype data in UKB should be expanded. Currently the analysis uses summary statistics derived in UKB with varying levels of overlap; however, this does not match the real scenario that EraSOR will likely be used in (applying EraSOR to an externally-sourced GWAS and applied to UK Biobank). The authors should perform a descriptive analysis to show that EraSOR is useful in this real-world scenario by downloading summary statistics from the GWAS Catalog (with and without inclusion of UK Biobank), applying EraSOR, and quantifying the difference in accuracy (r2) and effect size. On a related note: does the ancestry of the summary statistics have to perfectly match the target cohort? How well does EraSOR work with multiancestry summary statistics where the LD-panel might be mismatched? The point about insufficient adjustment the authors raise on lines 336-42 is quite important. Proper signposting about the limits of the decorrelation is needed in the software description and the discussion. From this passage that the authors suggest that known sample overlap should be avoided and EraSOR should only be used as a sensitivity analysis to ensure that overlap does not exist? It would be useful to get the authors perspective on whether the evaluation of a PRS in a cohort derived using EraSOR-adjusted summary statistics can be seen as truly external to the source GWAS. The paper should be accompanied by a more detailed user guide and some test data for the EraSOR tool. Are there any diagnostic plots that are produced that could be used to inspect the data quality?

    4. Background Polygenic risk score (PRS) analyses are now routinely applied in biomedical research, with great hope that they will aid in our understanding of disease aetiology and contribute to personalized medicine. The continued growth of multi-cohort genome-wide association studies (GWASs) and large-scale biobank projects has provided researchers with a wealth of GWAS summary statistics and individual-level data suitable for performing PRS analyses. However, as the size of these studies increase, the risk of inter-cohort sample overlap and close relatedness increases. Ideally sample overlap would be identified and removed directly, but this is typically not possible due to privacy laws or consent agreements. This sample overlap, whether known or not, is a major problem in PRS analyses because it can lead to inflation of type 1 error and, thus, erroneous conclusions in published work.Results Here, for the first time, we report the scale of the sample overlap problem for PRS analyses by generating known sample overlap across sub-samples of the UK Biobank data, which we then use to produce GWAS and target data to mimic the effects of inter-cohort sample overlap. We demonstrate that inter-cohort overlap results in a significant and often substantial inflation in the observed PRS-trait association, coefficient of determination (R2) and false-positive rate. This inflation can be high even when the absolute number of overlapping individuals is small if this makes up a notable fraction of the target sample. We develop and introduce EraSOR (Erase Sample Overlap and Relatedness), a software for adjusting inflation in PRS prediction and association statistics in the presence of sample overlap or close relatedness between the GWAS and target samples. A key component of the EraSOR approach is inference of the degree of sample overlap from the intercept of a bivariate LD score regression applied to the GWAS and target data, making it powered in settings where both have sample sizes over 1,000 individuals. Through extensive benchmarking using UK Biobank and HapGen2 simulated genotype-phenotype data, we demonstrate that PRSs calculated using EraSOR-adjusted GWAS summary statistics are robust to inter-cohort overlap in a wide range of realistic scenarios and are even robust to high levels of residual genetic and environmental stratification.Conclusion The results of all PRS analyses for which sample overlap cannot be definitively ruled out should be considered with caution given high type 1 error observed in the presence of even low overlap between base and target cohorts. Given the strong performance of EraSOR in eliminating inflation caused by sample overlap in PRS studies with large (>5k) target samples, we recommend that EraSOR be used in all future such PRS studies to mitigate the potential effects of inter-cohort overlap and close relatedness.

      This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giad043), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows: ** Jack Pattee **(revision 1)

      Thank you for your detailed responses; I have no further comments.

    5. Background Polygenic risk score (PRS) analyses are now routinely applied in biomedical research, with great hope that they will aid in our understanding of disease aetiology and contribute to personalized medicine. The continued growth of multi-cohort genome-wide association studies (GWASs) and large-scale biobank projects has provided researchers with a wealth of GWAS summary statistics and individual-level data suitable for performing PRS analyses. However, as the size of these studies increase, the risk of inter-cohort sample overlap and close relatedness increases. Ideally sample overlap would be identified and removed directly, but this is typically not possible due to privacy laws or consent agreements. This sample overlap, whether known or not, is a major problem in PRS analyses because it can lead to inflation of type 1 error and, thus, erroneous conclusions in published work.Results Here, for the first time, we report the scale of the sample overlap problem for PRS analyses by generating known sample overlap across sub-samples of the UK Biobank data, which we then use to produce GWAS and target data to mimic the effects of inter-cohort sample overlap. We demonstrate that inter-cohort overlap results in a significant and often substantial inflation in the observed PRS-trait association, coefficient of determination (R2) and false-positive rate. This inflation can be high even when the absolute number of overlapping individuals is small if this makes up a notable fraction of the target sample. We develop and introduce EraSOR (Erase Sample Overlap and Relatedness), a software for adjusting inflation in PRS prediction and association statistics in the presence of sample overlap or close relatedness between the GWAS and target samples. A key component of the EraSOR approach is inference of the degree of sample overlap from the intercept of a bivariate LD score regression applied to the GWAS and target data, making it powered in settings where both have sample sizes over 1,000 individuals. Through extensive benchmarking using UK Biobank and HapGen2 simulated genotype-phenotype data, we demonstrate that PRSs calculated using EraSOR-adjusted GWAS summary statistics are robust to inter-cohort overlap in a wide range of realistic scenarios and are even robust to high levels of residual genetic and environmental stratification.Conclusion The results of all PRS analyses for which sample overlap cannot be definitively ruled out should be considered with caution given high type 1 error observed in the presence of even low overlap between base and target cohorts. Given the strong performance of EraSOR in eliminating inflation caused by sample overlap in PRS studies with large (>5k) target samples, we recommend that EraSOR be used in all future such PRS studies to mitigate the potential effects of inter-cohort overlap and close relatedness.

      This work has been peer reviewed in GigaScience (see Description), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

      ** Jack Pattee**

      Overall, I think that this manuscript is strong and describes a well-formulated method to address a relevant problem. There are a few outstanding questions about the performance of the EraSOR method from my perspective, which I'll detail as follows.My understanding of reference [16] indicates that equation (3) of this manuscript only holds for null SNPs, i.e. if SNP g is not associated with the outcome Y. If this is the case, then this should be discussed in the manuscript. I wonder if this can partially explain the 'under-estimation' behavior we see in the application to real data in Supplementary Figure 3. In particular, I am referencing the behavior where the EraSOR correction will under-estimate the predictive accuracy of the PRS in the target data, i.e. where delta-R^2 is negative. This behavior is not seen in the simulation and warrants further investigation and discussion. While the bias appears small, for some cases delta-R^2 approaches -.025, which corresponds to an under-estimation of Pearson's r by roughly .15; this is substantial. Could it be the case that, for highly polygenic traits such as height and BMI, the null-SNP assumption is unreliable and the performance of EraSOR is degraded? Does a fundamental assumption of sparse genetic association underlie EraSOR?I recommend that the real data application play a larger role in the manuscript narrative and be moved out of the supplementary. The simulations are appreciated and helpful, but there is nuance in the analysis of real data that cannot be replicated in simulation.I believe the reference to "Supplementary Figure 2" on line 346 should actually be "Supplementary Figure 3". I believe that the axis labels in Supp Figure 3 are flipped.Lines 82 and 83 reference genetic stratification and subpopulations; I think the relevance of these concepts should be introduced more clearly and they should be defined in this context. EraSOR concerns the overestimation of predictive accuracy and association incurred by sample overlap between the base and target GWASs; to this reader, it's not clear what this central issue has to do with population stratification. I realize that the derivation of the LD score method is motivated heavily by correcting for stratification; however, these concepts should be introduced more clearly in this manuscript.Line 88: consider defining LD score l_j.Lines 94-96: consider outlining the mathematical consequence of the assumption that "the two outcomes and cohorts are identical." It's the case that N_1 = N_2 = N_c = N, correct?Line 109 / equation (11): My understanding is that the relevant quantity of this derivation is N_c / sqrt(N_1 N_2), which allows us to define the correct matrix C in expression (4). If this is the case, perhaps the quantity of interest should be moved to the LHS of the equation in the final line of the expression, for clarity.As discussed in the manuscript, the estimated heritability is in the denominator of the expression for N_c / sqrt(N_1 N_2). The authors correctly discuss that the method should not be applied when there is doubt as to whether the heritability is different from zero. I would take this a step further; in cases where the heritability is zero, we cannot meaningfully apply the EraSOR correction, and thus I am not sure of the utility of the 'type I error' simulations in the manuscript. Perhaps an explicit test for h^2 > 0 should be worked into the EraSOR workflow?Line 148 / expression (12): If beta has a normal distribution here, it is the case that all SNPs in the simulation are associated with the outcome Y. This is a somewhat unusual choice for the distribution of SNP effects in a simulation; other applications such as LDPred (Vilhjalmsson et al, AJHG 2015) and LassoSum (TSH Mak et al, Genetic Epi 2017) use a point-normal distribution for simulated SNP effects, which effectively simulates the sparsity frequently observed in nature. Is there a reference or justification for the non-sparse simulation structure here?Line 215: there may be a typo in the expression for the variance of the residual term. Is it the case that the variance of the residual depends on the variance of a covariance term? If so, I am confused as to the derivation.Line 241: 'triat' should be 'trait'.The simulation results in this paper are based on clumping and thresholding for PRS, which does not estimate joint SNP effects i.e. account for LD. Methods such as LDPred and LassoSum do so. Is there any reason to believe the results would be different for a method such as LassoSum?I am confused by the very low Fst between the simulated Finnish and Yoruban samples in simulation. As detailed on line 385: the reported Fst is > .1, but the simulated Fst is essentially zero. This seems likely to be an undesirable simulation artefact, and potentially invalidates the simulation study (or, at least, doesn't provide evidence that EraSOR functions correctly when Fst is large, which was the ostensible motivation for this simulation). Is there no way to effectively simulate populations with a larger Fst?

    6. Background Polygenic risk score (PRS) analyses are now routinely applied in biomedical research, with great hope that they will aid in our understanding of disease aetiology and contribute to personalized medicine. The continued growth of multi-cohort genome-wide association studies (GWASs) and large-scale biobank projects has provided researchers with a wealth of GWAS summary statistics and individual-level data suitable for performing PRS analyses. However, as the size of these studies increase, the risk of inter-cohort sample overlap and close relatedness increases. Ideally sample overlap would be identified and removed directly, but this is typically not possible due to privacy laws or consent agreements. This sample overlap, whether known or not, is a major problem in PRS analyses because it can lead to inflation of type 1 error and, thus, erroneous conclusions in published work.Results Here, for the first time, we report the scale of the sample overlap problem for PRS analyses by generating known sample overlap across sub-samples of the UK Biobank data, which we then use to produce GWAS and target data to mimic the effects of inter-cohort sample overlap. We demonstrate that inter-cohort overlap results in a significant and often substantial inflation in the observed PRS-trait association, coefficient of determination (R2) and false-positive rate. This inflation can be high even when the absolute number of overlapping individuals is small if this makes up a notable fraction of the target sample. We develop and introduce EraSOR (Erase Sample Overlap and Relatedness), a software for adjusting inflation in PRS prediction and association statistics in the presence of sample overlap or close relatedness between the GWAS and target samples. A key component of the EraSOR approach is inference of the degree of sample overlap from the intercept of a bivariate LD score regression applied to the GWAS and target data, making it powered in settings where both have sample sizes over 1,000 individuals. Through extensive benchmarking using UK Biobank and HapGen2 simulated genotype-phenotype data, we demonstrate that PRSs calculated using EraSOR-adjusted GWAS summary statistics are robust to inter-cohort overlap in a wide range of realistic scenarios and are even robust to high levels of residual genetic and environmental stratification.Conclusion The results of all PRS analyses for which sample overlap cannot be definitively ruled out should be considered with caution given high type 1 error observed in the presence of even low overlap between base and target cohorts. Given the strong performance of EraSOR in eliminating inflation caused by sample overlap in PRS studies with large (>5k) target samples, we recommend that EraSOR be used in all future such PRS studies to mitigate the potential effects of inter-cohort overlap and close relatedness.

      This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giad043), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

      Christopher C. Chang Reviewer Comments to Author: This paper addresses a significant need that has arisen in the interaction between privacy rules and ever-larger genomic datasets, and I find the results to be very promising and clearly worth publishing. I just have a few comments on some methodological details:line 130: Have you compared the effectiveness of this algorithm with plink2 --king-cutoff?lines 145-155: If I understand this correctly, these simulated quantitative traits are still normally distributed, they just aren't standardized to mean 0 variance 1. If the intent is to "simulate phenotypes that [do] not follow the standard normal distribution", I'd expect it to be more valuable to look at e.g. the log-normal case, where an alert user might transform the phenotype to normal, but some users may fail to do so. A mixture distribution may also be worth looking at.lines 238-239: Have you considered using the "cc-residualize" option of plink2 -glm, which removes most of the computational cost of including PCs in your binary trait analysis?lines 383-387: This is interesting; there is some room for follow-up investigation here. Thanks for posting all the scripts needed for another researcher to easily reproduce this Fst=0.00639 value; this could help facilitate development of a better genotype-simulation tool.Also, some minor copyedits:line 84: "subpopulation" -> "subpopulations"line 342: "overlaps" -> "overlap"line 363: "ErasOR" -> "EraSOR"line 376: "different level of environmental stratifications" -> "different levels of environmental stratification"line 384: "population" -> "populations"line 402: "capture" -> "captured"

    1. Editor’s Assessment

      Like other mollusc species, the freshwater pearl mussel (Margaritifera margaritifera) has a challenging genome to assemble owing to the large size of their genomes, heterozygosity, and repetitive sequence. The first published M. margaritifera genome was highly fragmented, but here an improved reference genome assembly was generated using PacBio CLR long reads to reduce fragmentation levels, missing and truncated genes, and chimerically assembled regions. The number of gene models predicted is a bit higher compared than other molluscan genomes, but after clarification and double checking these seem in line with some Mollusca and Bivalvia with similar and higher numbers of gene predictions. This new genome represents a new resource to start exploring the many biological, ecological, and evolutionary features of this threatened and commercially important group of organisms.

      This assessment refers to version 1 of this preprint.

    1. Editor’s Assessment

      Hybrid genomes are tricky to assemble, and few genomic resources are available for hybrid grapevines such as ‘Chambourcin’, a French-American interspecific hybrid grape grown in the eastern and midwestern United States. Here is an attempt to assemble Chambourcin’ using a combination of PacBio HiFi long-reads, Bionano optical maps, and Illumina short-read sequencing technologies. Producing an assembly with 26 scaffolds, an N50 length 23.3 Mb and an estimated BUSCO completeness of 97.9% that can be used for genome comparisons, functional genomic analyses, and genome-assisted breeding research. Error correction and pilon polishing was a challenge with this hybrid assembly, but after trying a few different approaches in the review process have improved it, and as they have documented what they did and are clear about the final metrics, users can assess the quality themselves.

      This assessment refers to version 2 of this preprint.

    2. Background ‘Chambourcin’ is a French-American interspecific hybrid grape variety grown in the eastern and midwestern United States and used for making wine. Currently, there are few genomic resources available for hybrid grapevines like ‘Chambourcin’.Results We assembled the genome of ‘Chambourcin’ using PacBio HiFi long-read sequencing and Bionano optical map sequencing. We produced an assembly for ‘Chambourcin’ with 27 scaffolds with an N50 length of 23.3 Mb and an estimated BUSCO completeness of 98.2%. 33,265 gene models were predicted, of which 81% (26,886) were functionally annotated using Gene Ontology and KEGG pathway analysis. We identified 16,501 common orthologs between ‘Chambourcin’ gene models, V. vinifera ‘PN40024’ 12X.v2, VCOST.v3, V. riparia ‘Manitoba 37’ and V. riparia Gloire. A total of 1,589 plant transcription factors representing 58 different gene families were identified in ‘Chambourcin’. Finally, we identified 310,963 simple sequence repeats (SSRs), repeating units of 16 base pairs in length in the ‘Chambourcin’ genome assembly.Conclusions We present the genome assembly, genome annotation, protein sequences and coding sequences reported for ‘Chambourcin’. The ‘Chambourcin’ genome assembly provides a valuable resource for genome comparisons, functional genomic analysis, and genome-assisted breeding research.

      This work has been published in GigaByte Journal under a CC-BY 4.0 license (https://doi.org/10.46471/gigabyte.84) and has published the reviews under the same license. These are as follows.

      **Reviewer 1. Lingfei Shangguan ** Reviewers Comments: Grapevine is one of the most important fruit crops in the world, and ‘Chambourcin’ is a hybrid wine grape variety in the world, which represented the cross species between North American and European Vitis species. The authors have sequenced the genome sequence of ‘Chambourcin’, and obtained the repeat sequences and gene annotation information. However, the sequence depth was too low for the grape genome, especially the high heterozygosity. They also not applied the illumine sequencing for sequence correction.

      Re-review: Since the authors have made some correction and improvement, the genome quality was still low, and the manuscript has not improvement significantly. Authors should provide the haplotype sequences, and describe the genome assembly and correction steps more clearly. Moreover, the innovation of the article is insufficient. I suggest reject.

      **Reviewer 2. Pablo Carbonell-Bejerano **

      Are all data available and do they match the descriptions in the paper? No. Access to the raw data for the RNA-seq dataset that was used for gene predictions is not indicated

      Are the data and metadata consistent with relevant minimum information or reporting standards?

      No. Any description of the RNA-seq dataset and its origin or features is fully missing. I could not find other data that would be required according to guidelines in http://gigadb.org/site/guide: - Full (not summary) BUSCO results output files (text) - readme.txt including all file names with a brief description of each - sample metadata that complies with the Genomic Standards Consortium.

      Is the data acquisition clear, complete and methodologically sound?

      Yes. Sequencing and bioinformatic methods followed are generally sound.

      Is there sufficient detail in the methods and data-processing steps to allow reproduction? No. 1. Availability for the scripts used in bioinformatic analyses and data plotting is generally missing.

      1. L124. Authors describe that minimap2 was used to obtain the dotplot. However, minimap2 alone does not produce dotplots.

      2. L131. It is unclear how ‘PN40024’ 12X.v2, VCost.v3 protein annotations were used as input of BRAKER2. Do authors mean protein sequences instead? Where were these protein data retrieved from? How are proteins aligned to the assembly? Was BRAKER run from masked or unmasked assembly?

      Is there sufficient data validation and statistical analyses of data quality? No. 1. Validation of the original material for its true-to-typeness as 'Chambourcin' cultivar genotype is not mentioned, neither the number of different plants used for DNA extraction. While post-assembly validation of the Chambourcin genome assembly genotype from the mapped Chambourcin rhAmpSeq markers may be possible, such genotype validation is not mentioned either in the text.

      1. In general, the quality and the genome variation represented in the Chambourcin genome assembly produced here could have been further tested. For instance, from 2% BUSCO duplication and 501.5 Mb of primary assembly size as compared to the 481.5 Mb haploid genome size that can be inferred from the k-mer analysis presented by the authors indicates, it seems that further duplication purging of the primary assembly is likely needed. This issue could be addressed by looking for assembly regions with reduced alignment depth when all HiFi reads are mapped to the primary assembly. Duplicated regions to be purged could also be supported by co-linear assembly segments sharing BUSCO duplicated genes. For assembly reliability assessment, 10X, rhAmpSeq, or Illumina WGS data that is available for Chambourcin could also be used to validate genome variants represented in this Chambourcin assembly when comparing the inter-haplotype variants detected between primary and haplotig assemblies or the haplotypes with genome assemblies from other genotypes.

      Is the validation suitable for this type of data? Yes. The validation is suitable, although it might not suffice in all cases.

      Is there sufficient information for others to reuse this dataset or integrate it with other data? No. As described before, there is missing information at several instances, like for the origin of the RNA-seq.

      Additional Comments: 1. L171. Is it correct that total length of Bionano maps was as small as 962,964 bp? Or do authors mean kb instead of bp in that sentence?

      1. The mapping of Chambourcin rhAmpSeq markers could have been further exploited to phase contig haplotypes before purging haplotypes and assembly scaffolding?

      2. For the Conclusion in L254, it might be arguable whether the presented Chambourcin genome assembly is the first genome assembly of a complex interspecific hybrid or not. For instance 'Shine Muscat' might also be considered a complex inter-specific hybrid grape cultivar and its genome assembly was published: https://academic.oup.com/dnaresearch/article/29/6/dsac040/6808674 It might even be arguable whether the one presented in this publication is the first Chambourcin genome assembly as there is a 10X Genomics-based assembly available for Chambourcin: https://www.nature.com/articles/s41467-019-14280-1

      Re-review: Efforts to improve the accuracy of the MS and the availability of data are clear in the revised version. Authors have included descriptions of M&M procedures and information about the origin of several datasets that were missing. They also included files with commands and original results to the FTP server. In addition, they did further de-duplication of the assembly, added Illumina sequencing for assembly polishing, and included further QC stats and comparisons to another recently published hybrid grapevine genome assembly.

      Most revision actions were successful. However, it is not recommended to polish HiFi assemblies with Illumina reads as in most cases it harms the consensus quality more than it improves it, which is particularly true for repetitive and highly heterozygous genomes like the one of Chambourcin grapevine cultivar. In fact, the BUSCO Completeness of 97.9% after Pilon short-read polishing compared to 98.2% in the former version indicates that polishing with Illumina short-reads is indeed harming in this revised version. I indeed agree with authors that 28x depth of PacBio HiFi reads should suffice to produce a quality genome assembly without using more depth or another sequencing technologies as they indicate in their response. I would recommend to remove the Pilon polishing from the final assembly version, which is only recommended in error-prone PacBio CLR or Nanopore assemblies. Instead, authors could use the Illumina reads for k-mer analysis of assembly consensus quality and completeness.

      **Editorial Board Member adjudication: **

      Comment 1. How many times did you do the polishing with Pilon? This is not clear in the documents provided. It could be 1 round or many. Many would be a concern. When we run error correction on genomes, we monitor BUSCO and when it drops, roll back one iteration. Comment 2. How many sites were corrected in the polishing of the primary and haplotig assembly? Comment 3. Can you run KAT (KAT: A K-Mer Analysis Toolkit to Quality Control NGS Datasets and Genome Assemblies.” Bioinformatics 33 (4): 574–76) to check the diploid, primary and haplotig assemblies? Comment 4. Can you align the mRNAseq and whole genome shotgun reads to diploid, primary and haplotig assemblies and report the percent mapping including the properly paired?

  14. Jun 2023
    1. Tissue clearing is currently revolutionizing neuroanatomy by enabling organ-level imaging with cellular resolution. However, currently available tools for data analysis require a significant time investment for training and adaptation to each laboratory’s use case, which limits productivity. Here, we present FriendlyClearMap, an integrated toolset that makes ClearMap1 and ClearMap2’s CellMap pipeline easier to use, extends its functions, and provides Docker Images from which it can be run with minimal time investment. We also provide detailed tutorials for each step of the pipeline.For more precise alignment, we add a landmark-based atlas registration to ClearMap’s functions as well as include young mouse reference atlases for developmental studies. We provide alternative cell segmentation method besides ClearMap’s threshold-based approach: Ilastik’s Pixel Classification, importing segmentations from commercial image analysis packages and even manual annotations. Finally, we integrate BrainRender, a recently released visualization tool for advanced 3D visualization of the annotated cells.As a proof-of-principle, we use FriendlyClearMap to quantify the distribution of the three main GABAergic interneuron subclasses (Parvalbumin+, Somatostatin+, and VIP+) in the mouse fore- and midbrain. For PV+ neurons, we provide an additional dataset with adolescent vs. adult PV+ neuron density, showcasing the use for developmental studies. When combined with the analysis pipeline outlined above, our toolkit improves on the state-of-the-art packages by extending their function and making them easier to deploy at scale.

      This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giad035 ), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

      **Reviewer Yimin Wang **

      This work (FriendlyClearMap) attempts to combine several tools such as ClearMap 1/2, BrainRender, etc., and integrate certain functions into a Docker image for the ease of use. The authors then demonstrated the use of FriendlyClearMap by analysing PV+, SST+ and VIP+ neurons. Some details comments are as below:

      1/ P4, second paragraph, line 3, "vs." -> "versus".

      2/ P9, third paragraph, line 8, conflict between "lastly" and "finally"

      3/ P9, third paragraph, line 8, "our tool allows …".

      4/ This work can be regarded as a reengineering effort based on several previous toolkits in order to facilitate the workflow of registration, segmentation, analysis, and visualization. Essentially, no new technology involved is involved in this work and no new application is enabled by FriendlyClearMap. Therefore, in order to emphasize the unique contribution of this work, the author could elaborate how this tool makes biologists' work easier.

      5/ The results for Figure 2g are somewhat trivial. The authors might consider replace it with some more impressive analysis.

      6/ The majority of the results are related to cell segmentation and counting. Quantitative plots/tables could be provided for more information. In addition, the accuracy of the results could also be discussed.

      7/ Last but not least, as there is no substantial novelty in the software, the authors actually could consider change the focus of the manuscript from a tool paper to a resource/results paper, emphasizing new biological findings which is obtained by using FriendlyClearMap.

    2. Tissue clearing is currently revolutionizing neuroanatomy by enabling organ-level imaging with cellular resolution. However, currently available tools for data analysis require a significant time investment for training and adaptation to each laboratory’s use case, which limits productivity. Here, we present FriendlyClearMap, an integrated toolset that makes ClearMap1 and ClearMap2’s CellMap pipeline easier to use, extends its functions, and provides Docker Images from which it can be run with minimal time investment. We also provide detailed tutorials for each step of the pipeline.For more precise alignment, we add a landmark-based atlas registration to ClearMap’s functions as well as include young mouse reference atlases for developmental studies. We provide alternative cell segmentation method besides ClearMap’s threshold-based approach: Ilastik’s Pixel Classification, importing segmentations from commercial image analysis packages and even manual annotations. Finally, we integrate BrainRender, a recently released visualization tool for advanced 3D visualization of the annotated cells.As a proof-of-principle, we use FriendlyClearMap to quantify the distribution of the three main GABAergic interneuron subclasses (Parvalbumin+, Somatostatin+, and VIP+) in the mouse fore- and midbrain. For PV+ neurons, we provide an additional dataset with adolescent vs. adult PV+ neuron density, showcasing the use for developmental studies. When combined with the analysis pipeline outlined above, our toolkit improves on the state-of-the-art packages by extending their function and making them easier to deploy at scale.

      This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giad035 ), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer Chris Armit

      This Technical Note paper describes "FriendlyClearMap: An optimized toolkit for mouse brain mapping and analysis".

      Whereas the core concept of a data analysis tool to assist in spatial mapping of cleared mouse tissues is perfectly reasonable, there are multiple issues with the documentation that renders this toolkit very difficult to use. I detail below some of the issues I have encountered.

      1. GitHub repositoryThe installation instructions are missing from the following GitHub repository: https://github.com/MoritzNegwer/FriendlyClearMap-scriptsThe closest reference I could find to installation instructions is the following: "Please see the Appendices 1-3 of our <X_upcoming> publication for detailed instructions on how to use the pipelines. <X_protocols.io goes here>"Step-bystep installation instructions should be included in the GitHub repository. In addition, the authors should add the protocols.io links to their GitHub repository.

      2. Protocols.ioThe installation instructions are missing from the following protocols.io links:Run Clearmap 1 docker dx.doi.org/10.17504/protocols.io.eq2lynnkrvx9/v1Run Clearmap 2 docker dx.doi.org/10.17504/protocols.io.yxmvmn9pbg3p/v1Both of these protocols include the following instruction:* "Then, download the docker container from our repository: XXX docker container goes here"In the documentation, the authors need to unambiguously refer to the specific Docker container that a user needs to install for each software tool.

      3. Test Data I could not find the test data in the form of image stacks that would be needed to test the FriendlyClearMap protocols. Figure 1 refers to 16-bit TIFF image stacks, and I presume these to be the input data that is needed for the image analysis pipelines described in the manuscript. The authors should provide details of the test imaging dataset, including links if necessary to where the image stacks data can be downloaded, in the 'Data Availability' section of the manuscript.

      4. Platform / Operating SystemsIn the 'Data Availability' section of the manuscript, the authors specify that the Operating Systems are "platform-independent". However, the protocols.io documents lists a set of requirements for Windows and LINUX, but not for MacOS. The authors should provide installation instructions and system requirements for MacOS.I reject this manuscript on the grounds that, due to lack of appropriate documentation and installation instructions, the software tool is too difficult to use and therefore has extremely low reuse potential.

    1. Background Reproducibility of data analysis workflow is a key issue in the field of bioinformatics. Recent computing technologies, such as virtualization, have made it possible to reproduce workflow execution with ease. However, the reproducibility of results is not well discussed; that is, there is no standard way to verify whether the biological interpretation of reproduced results are the same. Therefore, it still remains a challenge to automatically evaluate the reproducibility of results.Results We propose a new metric, a reproducibility scale of workflow execution results, to evaluate the reproducibility of results. This metric is based on the idea of evaluating the reproducibility of results using biological feature values (e.g., number of reads, mapping rate, and variant frequency) representing their biological interpretation. We also implemented a prototype system that automatically evaluates the reproducibility of results using the proposed metric. To demonstrate our approach, we conducted an experiment using workflows used by researchers in real research projects and the use cases that are frequently encountered in the field of bioinformatics.Conclusions Our approach enables automatic evaluation of the reproducibility of results using a fine-grained scale. By introducing our approach, it is possible to evolve from a binary view of whether the results are superficially identical or not to a more graduated view. We believe that our approach will contribute to more informed discussion on reproducibility in bioinformatics.

      This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giad031 ) , which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

      **Reviewer Stian Soiland-Reyes ** Hi, I am Stian Soiland-Reyes https://orcid.org/0000-0001-9842-9718 and have pledged the Open Peer Review Oath https://doi.org/10.12688/f1000research.5686.2: *

      Principle 1: I will sign my name to my review Principle 2: I will review with integrity Principle 3: I will treat the review as a discourse with you; in particular, I will provide constructive criticism Principle 4: I will be an ambassador for the practice of open science. This review is licensed under a Creative Commons Attribution 4.0 International License

      . --- This article presents a method for comparing reproducibility of computational workflow runs captured as RO-Crates, by calculating a set of genomics metrics ("features") and adding these to the crate's metadata. Overall I find this a valuable contribution and worthy of publication with GigaScience, primarily as a way for users of workflow systems CWL, Nextflow, Cromwell or Snakemake to ensure reproducibility, but also for workflow engine developers who may want to build on this methodology to improve their provenance support. In general the method proposed is sound, however it does have some limitations and inherent assumptions that are not highlighted sufficiently in the current manuscript, particularly concerning the selection of features and the reproducibility of the metrics calculation itself. I have detailed this with some points below that I would like the authors to clarify in a minor revision.

      --- Note - the below questions from GigaScience Reviewer Guidelines mainly relate to data, but I also here interpret them for the software described.

      Q1: Is the rationale for collecting and analyzing the data well defined? The author's workflow executions https://doi.org/10.5281/zenodo.7098337 are based on three 3rdparty bioinformatics workflows. Although they are not particularly "large-scale", they are representative best-practice pipelines in this field (data sizes from 200 MB to 6 GB) and also fairly representative for scalable workflow systems (Nextflow, CWL and WDL) used by bioinformaticians.

      Q2: Is it clear how data was collected and curated? It is not explicit in the text why these particular workflows were selected, beyond being realistic pipelines used in research. I would suggest something like "these workflows have been selected as fairly representative and mature current best-practice for sequencing pipelines, implemented in different but typical workflow systems, and have similar set of genomics features that we can assess for provenance comparison." The workflows have each been cited, but I would appreciate some consistency so that each workflow is cited both by its closest journal article and as their original download sources (e.g. GitHub).

      Q3: Is it clear - and was a statement provided - on how data and analyses tools used in the study can be accessed? Yes, full availability statements have been provided both for data and software, archived on Zenodo for longevity.

      Q4: Are accession numbers given or links provided for data that, as a standard, should be submitted to a community approved public repository? Yes, the tools have been added to https://bio.tools/ -- I don't think it's necessary to further register the data outputs with accession numbers. RRIDs for tools can be considered at a later stage, perhaps only for Sapporo.

      Q5: Is the data and software available in the public domain under a Creative Commons license? Yes, the software and dataset is open source under Apache License, version 2.0. The dataset https://doi.org/10.5281/zenodo.7098337 embeds existing workflows and data, however this is OK as included resources such as the rnaseq Nextflow workflow have compatible licenses (MIT) or are also Apache-licensed. The manuscript has software citations for two of the workflows, but this is missing for the CWL workflow, which is only cited by manuscript (33) (also missing DOI). It is unclear if any of the workflows are registered in https://workflowhub.eu/ but that should primarily be done by their upstream authors. The RO-Crates in https://doi.org/10.5281/zenodo.7098337 don't include any licensing and attribution for the embedded workflows, and its metadata file is misleadingly declaring the crate license as CC0 public domain. While CC0 is appropriate for examples and metadata file itself, the embedded MIT/Apache workflows from third parties can't legally be relicensed in this way and should have their original licenses declared. See https://www.researchobject.org/ro-crate/1.1/contextualentities.html#licensing-access-control-and-copyright I understand these RO-Crates are generated automatically by Sapporo, which does not directly understand licensing, and for documenting the test runs with Sapporo, I think these should not be modified post-execution. Pending further license support by Sapporo, perhaps a manual outer RO-Crate that aggregate these (e.g. adding a direct top-level ro-crate-metadata.json to the Zenodo entry) can provide more correct metadata as well as workflow citations. The authors could add to Discussion some consideration on (lack of) propagation of such metadata for auto-generated crates as part of workflow run provenance. For instance, if a workflow run was initiated from a Workflow Crate https://w3id.org/workflowhub/workflow-ro-crate/ at WorkflowHub, its license, attributions and descriptions could be carried forward to the final Workflow Run Crate provenance together with the Sapporo-calculated features.

      Q6: Are the data sound and well controlled? Yes, the data is sound. The testing on Mac gives null-results, but the authors explain the workflows failed to execute there due to archicectural differences, which is flagged as a valid concern for reproducibility. It may be worth further investigating if this is due to misconfiguration on that particular test machine in which case these columns should be removed.

      Q7: Is the interpretation (Analysis and Discussion) well balanced and supported by the data? The authors' discussion have some implicit assumptions that should be made more clear, together with implications: The Tonkaz tool assumes the workflow execution has already extracted the features and added them to the RO-Crate This assumes the right features have been correctly extracted by each execution Feature extraction also depend on bioinformatics tools that are subject to change/updates Newer versions of Sapporo-service, and in particular any non-Sapporo executors also making Workflow run Crates, may have a different feature selection Being able to fairly compare two workflow runs therefore depends on careful control of the Sapporo executor versions so that they have consistent feature selection This means the reproducibility metrics proposed has a potential reproducibility challenge itself This is not to say that the approach is bad, as the feature extraction is using predictable measures such as counting sequences, rather than heuristics. This means Future Work should point out the need for guidelines on what kind of features should be selected, to ensure they are consistent and reproducible. The set of features also depend on the type of data and class of analysis. As a minimum, the RO-Crate should therefore include provenance of that feature extraction, noting the Sapporo version, and ideally the version of the tools used for that. The authors may want to consider if feature extraction should be a separate workflow (e.g. in CWL), that itself can be subject to the same reproducibility preservation measures, and therefore also can be performed post-execution as part of Tonkaz' comparison or as a curation activity when storing Workflow Run Crates.

      Q8: Are the methods appropriate, well described, and include sufficient details and supporting information to allow others to evaluate and replicate the work? Yes, it was very easy to replicate the Tonkaz analysis of the workflow run crate that is already provided, as it is provided also as a Docker container. The Docker container is provided as part of GitHub releases, and so is not at risk of Docker Hub's automatic deletion. I have not tried installing my own Sapporo service to re-execute the workflow, but detailed installation and run details are provided in the README of both Tonkaz https://github.com/sapporowes/tonkaz#readme and sapporo-service https://github.com/sapporowes/sapporo/blob/main/docs/GettingStarted.md

      Q9: What are the strengths and weaknesses of the methods? The method provided is strong compared to naive checksum-based comparison of workflow outputs, which has been pointed out as a challenge by previous work. The advantage of the feature extraction is that the statistics can be compared directly and any disreprancies can be displayed to the user at a digestible high-level. The disadvantage is that this depends wholy on the selection of features, which must be done carefully to cover the purpose of the particular workflow and its type of data. For instance, a workflow that generates diagrams of sequence alignments could not be sufficiently tested in the suggested approach, as analyzing the diagram for correctness would require tools that may not even exist. Perhaps feature extraction should be a part of the workflow itself, so it can self-determine what is important for its analysis? The current approach also is quite sensitive to output data filenames, so changes in filename would mean features are not compared, even where such files are equivalent. This should be made more explicit in the manuscript, for instance workflows should ensure they don't include timestamps or random identifiers in their filenames. Further work could have a deeper understanding of the workflow structure to compare outputs based on their corresponding FormalParameter in the RO-Crate.

      Q10: Have the authors followed best-practices in reporting standards? Yes, the details provided are at a sufficient detail level, and the authors have re-used the RO-Crate data packaging. The RO-Crates created by Sapporo-service adds several terms for the metrics, which are declared on the @context according to RO-Crate specs https://www.researchobject.org/rocrate/1.1/appendix/jsonld.html#extending-ro-crate However the terms point to GitHub "raw" pages, which are not particularly stable, and may change depending on sapporo versions and GitHub's repository behaviour. I recommend changing the ad-hoc terms to PIDs such as a namespace under https://w3id.org/ or https://purl.org/ so that these terms can be stable semantic artefacts, e.g. submitting them to https://github.com/ResearchObject/ro-terms to register https://w3id.org/ro/terms/sapporo#WorkflowAttachment that can be used instead of https://raw.githubusercontent.com/sapporo-wes/sapporo-service/main/sapporo/roterms.csv#WorkflowAttachment or alternatively https://w3id.org/sapporo#WorkflowAttachment could be set up to redirect to the ro-terms.csv on GitHub. (discussed with the authors at ELIXIR Biohackathon) In doing so you should separate into two namespaces, the general Sapporo terms like "sha512", and the particular genomics feature sets including "totalReads" (e.g. https://w3id.org/datafeatures/genomics#WorkflowAttachment) as the second are a) Not sapporo-specific b) domainspecific. RO-Crate is developing Workflow Run profiles https://www.researchobject.org/workflow-runcrate/profiles/, although these have not been released at time of my review they are now stable, so the authors may want to check https://www.researchobject.org/workflow-runcrate/profiles/workflow_run_crate to ensure "FormalParameter" are declared correctly in the generated RO-Crate as separate entities, linked from the "File" using "exampleOfWork".

      Q11: Can the writing, organization, tables and figures be improved? The language and readability of this article is generally very good. Light copy-editing may improve some of the sentences, e.g. reducing the use of "Thus" phrases.

      Q12: When revisions are requested. See suggestions from above for minor revisions: Make explicit why these 3 workflows where selected (see Q2) Make pipeline software citations consistent in manuscript (see Q2, Q5) Avoid declaring CC0 within generated RO-Crate -- move this to only apply to the ro-cratemetadata.json Add an outer RO-Crate metadata file to Zenodo deposit to carry the correct licenses and pipeline licenses for each of rnaseq_1st.zip, trimming.zip etc. Improve discussion to better reflect limitations of the features and its own reproducibility issues (see Q7, Q9) Consider improvements to the RO-Crate context (see Q10) - this may just be noted as Future Work in the manuscript rather than regenerating the crates In addition: p2: Add citation for claim on file checksums different depending on software versions etc., for instance https://doi.org/10.1145/3186266 p3. "We converted Sapporo's provenance into RO-Crate" -- re-cite (20) as this is the paragraph explaining what it is. p10. Citations 7, 8 are missing authors p10. Citation 15 is now published, replace with https://doi.org/10.1145/3486897 p0. Citations 28, 33 is missing DOI

      Q13: Are there any ethical or competing interests issues you would like to raise? No, the third-party pipelines selected for reproducibility testing are already published and are here represented fairly, and only used as executable methods (as intended by their original authors), which I would say do not need ethical approval.

    2. Background Reproducibility of data analysis workflow is a key issue in the field of bioinformatics. Recent computing technologies, such as virtualization, have made it possible to reproduce workflow execution with ease. However, the reproducibility of results is not well discussed; that is, there is no standard way to verify whether the biological interpretation of reproduced results are the same. Therefore, it still remains a challenge to automatically evaluate the reproducibility of results.Results We propose a new metric, a reproducibility scale of workflow execution results, to evaluate the reproducibility of results. This metric is based on the idea of evaluating the reproducibility of results using biological feature values (e.g., number of reads, mapping rate, and variant frequency) representing their biological interpretation. We also implemented a prototype system that automatically evaluates the reproducibility of results using the proposed metric. To demonstrate our approach, we conducted an experiment using workflows used by researchers in real research projects and the use cases that are frequently encountered in the field of bioinformatics.Conclusions Our approach enables automatic evaluation of the reproducibility of results using a fine-grained scale. By introducing our approach, it is possible to evolve from a binary view of whether the results are superficially identical or not to a more graduated view. We believe that our approach will contribute to more informed discussion on reproducibility in bioinformatics.

      This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giad031 ) , which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer Stephen R Piccolo:

      This manuscript describes a methodology for automating evaluation of the reproducibility of datascience workflows for genomics analyses. The authors explain that reproducibility should be evaluated on a scale rather than on a binary basis. They explain concepts related to these issues and apply their methodology to real-world data. The manuscript was well written and addresses an important issue. I believe this manuscript provides new insights. I have a few minor concerns that I would appreciate being addressed:

      • The manuscript indicates that it's not feasible to compare images automatically. However, this is actually pretty easy. For example, using the Pillow package in Python, you can calculate a percentage similarity between two image files. I'm not suggesting that the authors should do this in their study. But the text should not preclude this as a possibility.

      • The authors describe scenarios where the outputs might be different but these differences would be immaterial to the overall conclusions. They also describe a few scenarios where the outputs differ for biological features but that the differences are relatively small and could be considered to be acceptable. Examples include when BAM files are sorted differently. I think it would be helpful to add a bit more discussion of scenarios where differences in biological features could occur and what would cause those differences.

      • Although a person checking the outputs can change the numeric threshold, it would be difficult to know what that threshold should be. Perhaps the authors could describe additional situation(s) where having relatively large differences would be acceptable and other situation(s) where they would not. For example, you could have a single difference in the biological feature outputs and perhaps that would make a huge difference in the interpretation in some cases. Additional discussion would be helpful.

      • This paper focuses on automating the verification process. I think the big picture could be explained more. Who might perform this verification process in a scientific context? In what context would they do it? - Please add brief discussion about generalizing this methodology beyond Tonkaz.

    1. Background Integration of data from multiple domains can greatly enhance the quality and applicability of knowledge generated in analysis workflows. However, working with health data is challenging, requiring careful preparation in order to support meaningful interpretation and robust results. Ontologies encapsulate relationships between variables that can enrich the semantic content of health datasets to enhance interpretability and inform downstream analyses.Findings We developed an R package for electronic Health Data preparation ‘eHDPrep’, demonstrated upon a multi-modal colorectal cancer dataset (n=661 patients, n=155 variables; Colo-661). eHDPrep offers user-friendly methods for quality control, including internal consistency checking and redundancy removal with information-theoretic variable merging. Semantic enrichment functionality is provided, enabling generation of new informative ‘meta-variables’ according to ontological common ancestry between variables, demonstrated with SNOMED CT and the Gene Ontology in the current study. eHDPrep also facilitates numerical encoding, variable extraction from free-text, completeness analysis and user review of modifications to the dataset.Conclusion eHDPrep provides effective tools to assess and enhance data quality, laying the foundation for robust performance and interpretability in downstream analyses. Application to a multi-modal colorectal cancer dataset resulted in improved data quality, structuring, and robust encoding, as well as enhanced semantic information. We make eHDPrep available as an R package from CRAN [[URL will go here]].

      This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giad030 ), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer Janna Hastings

      The manuscript describes a toolkit for the automated semantic enrichment and quality control of electronic health data using ontologies. This is a much needed utility that will add value to electronic data sharing and re-use for many different purposes including the development of machine learning for medical applications and personalised medicine. Overall the manuscript is well written and the functionality offered by the toolkit is well thought out and motivated. The internal consistency checks and the use of ontology-based information content to semantically aggregate variables into more informative meta-variables are particularly welcome functions.

      However, I recommend that the description of the tool functionality be clarified in some points, and the evaluation could be strengthened.page 6-7, internal consistency:

      1. How should the user specify semantic dependencies between variable pairs? Would it not be helpful to use a standard format for this specification to enable interoperability and re-use of such specifications?

      2. Should the specification of semantic relationships between variables not be linked to the knowledge from the ontologies? Ontologies are able to represent many different types of logical relationships between classes, which make them ideal for then serving as a standard and interoperable format for specifying this type of constraint. Rules are another promising standard approach for logic-based knowledge representation.

      Page 11, figure 4 a: I think it would be informative for evaluating the operation of the tool if the heatmap of variable missingness after application of the tool could also be illustrated beside the current Fig 4a.

      Page 13, ontology preparation: The paragraph describes what the authors have done to prepare ontologies for use with the tool. Is this preparation procedure also necessary for users to follow when they use the eHDPrep tool? How can alternative ontologies be incorporated (which may be useful for other domains)?Evaluation: The biggest shortcoming of the presented manuscript is that the evaluation is limited to the application of the tool to one dataset and subsequent manual evaluation of the outcome by one group, the study authors.

      The results as presented are positive, but there is a significant risk that the tool performs well on this task, as assessed by these study authors, but then fails to generalise to other tasks and datasets that future users might wish to use it with. To mitigate against this challenge, it would be optimal if somewhat more independent methods could be found for evaluating the performance of the different aspects of the tool. One approach could a rigorous comparison of this tool's performance against the performance of other tools that have similar functionality, e.g. comparison of the semantic aggregation function with other tools that find and recommend MICAs. An alternative approach might be to apply the tool to an additional dataset for which a group outside of the study authors would be prepared to provide an independent evaluation.

    2. Background Integration of data from multiple domains can greatly enhance the quality and applicability of knowledge generated in analysis workflows. However, working with health data is challenging, requiring careful preparation in order to support meaningful interpretation and robust results. Ontologies encapsulate relationships between variables that can enrich the semantic content of health datasets to enhance interpretability and inform downstream analyses.Findings We developed an R package for electronic Health Data preparation ‘eHDPrep’, demonstrated upon a multi-modal colorectal cancer dataset (n=661 patients, n=155 variables; Colo-661). eHDPrep offers user-friendly methods for quality control, including internal consistency checking and redundancy removal with information-theoretic variable merging. Semantic enrichment functionality is provided, enabling generation of new informative ‘meta-variables’ according to ontological common ancestry between variables, demonstrated with SNOMED CT and the Gene Ontology in the current study. eHDPrep also facilitates numerical encoding, variable extraction from free-text, completeness analysis and user review of modifications to the dataset.Conclusion eHDPrep provides effective tools to assess and enhance data quality, laying the foundation for robust performance and interpretability in downstream analyses. Application to a multi-modal colorectal cancer dataset resulted in improved data quality, structuring, and robust encoding, as well as enhanced semantic information. We make eHDPrep available as an R package from CRAN [[URL will go here]].

      This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giad030 ), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer Hugo Leroux

      This well-written paper describes techniques for semantically-enriching clinical data pertaining to colorectal cancer diagnosis.It describes an R-based tool, eHDPrep, to extract the data, which is subsequently cleaned, actioned for missing and erroneous values, encoded and enriched semantically using SNOMED CT and the GO, and ultimately exported after having undergone some QC.The paper is well-written and the methods really well-explained, for which the authors should be commended.I only have a few comments for the authors:

      1. It is not clear to me how, in the discussion on page 14, the authors have dealt with the issue of representing negative findings and missing values, as described within their enrichment outcomes section.

      2. In the "Ontology Preparation" section, the authors describe how they have taken both the SNOMED CT terminology and performed some transformations to OWL and conversion to CSV format before mapping the Colo-661 variables to it. They don't however discuss the challenges that such an approach entails. The authors might consider perusing through this article (https://doi.org/10.1186/s13326-018-0191-z), which addresses many of the challenges relating to ontology matching

      3. Please insert an additional ")" when stating the "Equations", e.g. page 6: "... zero entropy [27] (Equation (1)) ...", also , page 13

    1. Background Eukaryotic gene expression is controlled by cis-regulatory elements (CREs), including promoters and enhancers, which are bound by transcription factors (TFs). Differential expression of TFs and their binding affinity at putative CREs determine tissue- and developmental-specific transcriptional activity. Consolidating genomic data sets can offer further insights into the accessibility of CREs, TF activity, and, thus, gene regulation. However, the integration and analysis of multi-modal data sets are hampered by considerable technical challenges. While methods for highlighting differential TF activity from combined chromatin state data (e.g., ChIP-seq, ATAC-seq, or DNase-seq) and RNA-seq data exist, they do not offer convenient usability, have limited support for large-scale data processing, and provide only minimal functionality for visually interpreting results.Results We developed TF-Prioritizer, an automated pipeline that prioritizes condition-specific TFs from multi-modal data and generates an interactive web report. We demonstrated its potential by identifying known TFs along with their target genes, as well as previously unreported TFs active in lactating mouse mammary glands. Additionally, we studied a variety of ENCODE data sets for cell lines K562 and MCF-7, including twelve histone modification ChIP-seq as well as ATAC-seq and DNase-seq datasets, where we observe and discuss assay-specific differences.Conclusion TF-Prioritizer accepts ATAC-seq, DNase-seq, or ChIP-seq and RNA-seq data as input and identifies TFs with differential activity, thus offering an understanding of genome-wide gene regulation, potential pathogenesis, and therapeutic targets in biomedical research.

      This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giad026 ), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer Kaixuan Luo

      This paper develops a novel pipeline TF-Prioritizer to prioritize condition-specific TFs thorough integrative analysis of histone modification (HM) ChIP-seq and RNA-seq data. The pipeline integrates multiple computational tools: calculate TF binding site affinities and link candidate binding sites to genes using the TRAP and TEPIC. It uses DYNAMITE, a sparse logistic regression classifier, to infer TFs related to differential gene expression between conditions. It computes an aggregated score "TF-TG score" to score TFs from multiple types of evidence, and obtains a prioritized list of TFs from all histone modifications using a discounted cumulative gain ranking approach. It also provides additional functionality and web interface to visualize the results.

      Overall, the pipeline could be very useful for biologists with a user-friendly web application to automate the entire process from data preprocessing to statistical analysis and obtain interactive reports to gain novel biological insights. However, more systematic evaluations are needed to demonstrate the benefits of this pipeline.

      Major comments:

      1. In the computation of an aggregated score "TF-TG score", it uses a multiplicative function to combine differential expression (absolute log2FC), TF-Gene scores computed from TEPIC, and the total coefficients computed from DYNAMITE. One concern about this approach is that it may miss some TFs with support from only one or two types of evidence. In Fig 5, we see diffTF identifies a lot more TFs than diffTF. I don't think we can conclude that diffTF is less specific than TF-Prioritizer simply based on the number of TFs prioritized. Some of the TFs identified only by diffTF may be important but missed by TF-Prioritizer? I would like to see more detailed analysis comparing the lists of TFs identified by diffTF and TF-Prioritizer. Other evidence or metrics in addition to the number of prioritized TFs would be helpful to evaluate the plausibility of the prioritized lists of TFs.

      2. It is hard to interpret and evaluate the contribution of the evidence for prioritized TFs. Figure 6b is helpful, but it is unclear how the users would be able to evaluate the contribution of the components. Does the software run each of the combination separately and outputs a list of prioritized TFs under each combination?

      3. The TEPIC2 paper has already developed a very comprehensive pipeline, including TF affinity calculation by TRAP and computation of TF gene scores by TEPIC, as well as logistic regression to identify TFs between conditions by DYNAMITE, and it is already well paralyzed. The authors should clearly list the novel contributions from this work. It would be helpful to have a table comparing the functionalities and technical features between TF-Prioritizer and TEPIC2.

      4. The software takes histone modification ChIPseq and RNA-seq data as input. It will significantly improve the usage of the software if it supports DNase-seq and/or ATAC-seq, which are widely used. If this software could take ATAC-seq or DNase-seq data as input, it is important to include those data types and provide some examples to illustrate the usage and performance.

      5. The software combines multiple histone modification ChIP-seq datasets using a discounted cumulative gain ranking approach. However, different types of histone modifications have different epigenomic functions and different combinations indicate different chromatin states. Some TFs may be only enriched in a small subset of histone modifications (already discussed by the authors) and may be missed by the simple discounted cumulative gain ranking approach. The authors should provide prioritized TFs from each histone modification ChIP-seq dataset, and evaluate which TFs were prioritized by all the combined datasets, and which TFs by only one dataset. Also, some ChIP-seq datasets may be of poor quality. Does the software provide other options to rank the TFs from different epigenomic datasets? e.g. set different weights for different epigenomic datasets, etc.

      6. The authors conducted cooccurrence analysis based on the overlapping of peaks. It is unclear if the method would calculate some statistical measure (e.g. p-value) for the significance of co-occurrence. Also, since the TRAP model generates quantitative measure of TF binding affinity, I am curious to see if the quantitative TF binding affinity are also correlated for those co-occurred binding sites.

      Minor comments: 1. In Figure 1, it would be helpful to highlight which steps were already implemented in existing tools (and label the tools used), and which steps are novel in this study. 2. H3K4me3 data seems to be missing in the L10 time point. How does the method handle missing data? 3. It is unclear how the Pol2 ChIP-seq data was used in this study? Was it included in the model or only in the downstream analysis? 4. It is hard to interpret the browser tracks of the TF predictions ("Predicted xxx") in Figure 3 and 4. Please add more details about those tracks .5. Figure 6, the authors should provide more details to help understand this figure, especially panel b. The figure legend is too short.

    2. Background Eukaryotic gene expression is controlled by cis-regulatory elements (CREs), including promoters and enhancers, which are bound by transcription factors (TFs). Differential expression of TFs and their binding affinity at putative CREs determine tissue- and developmental-specific transcriptional activity. Consolidating genomic data sets can offer further insights into the accessibility of CREs, TF activity, and, thus, gene regulation. However, the integration and analysis of multi-modal data sets are hampered by considerable technical challenges. While methods for highlighting differential TF activity from combined chromatin state data (e.g., ChIP-seq, ATAC-seq, or DNase-seq) and RNA-seq data exist, they do not offer convenient usability, have limited support for large-scale data processing, and provide only minimal functionality for visually interpreting results.Results We developed TF-Prioritizer, an automated pipeline that prioritizes condition-specific TFs from multi-modal data and generates an interactive web report. We demonstrated its potential by identifying known TFs along with their target genes, as well as previously unreported TFs active in lactating mouse mammary glands. Additionally, we studied a variety of ENCODE data sets for cell lines K562 and MCF-7, including twelve histone modification ChIP-seq as well as ATAC-seq and DNase-seq datasets, where we observe and discuss assay-specific differences.Conclusion TF-Prioritizer accepts ATAC-seq, DNase-seq, or ChIP-seq and RNA-seq data as input and identifies TFs with differential activity, thus offering an understanding of genome-wide gene regulation, potential pathogenesis, and therapeutic targets in biomedical research.

      This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giad026), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer: Roza Berhanu Lemma

      In this manuscript, Hoffmann and Trummer et al. reported a new automated pipeline that utilizes existing methods, namely (1) DESeq2 to perform differential gene expression between sample groups, (2) TEPIC, a method that links CREs to genes using a biophysical model TRAP and (3) DYNAMITE, which provides an aggregate score for TF-target genes that determine the contribution of TFs to condition specific changes between sample groups. Finally, the pipeline utilizes Mann-Whitney U test to prioritize TFs among a background distribution and a ChIP-seq specific TF distribution, which allows the identification of TFs with roles in condition-specific gene regulation. Their pipeline allows large-scale processing of data and returns a feature-rich and user-friendly interactive report.

      The authors demonstrated how to use TF-prioritizer using public datasets for mouse mammary gland development study and performed independent validation using datasets from ChIP-Atlas. They were able to capture both known TFs with previously reported roles in mammary gland development/lactation and new TFs that may have a role in these processes. The work is very well thought and executed but to keep the quality of the work even higher, the authors should address the following points.

      Major:

      1. Although their validation nicely portrays the potential application of their pipeline in answering biological questions, my fear is for this not to be an isolated case. Therefore, the authors should test their pipeline using another example dataset and convince their readers. A suggestion could be, to run TF-Prioritizer on one of deeply profiled cell lines (e.g. K562, MCF-7, etc) to investigate TF prioritizations for e.g during differentiation (change of cell fate) and see if lineage determining TFs are prioritized in such cases. This may potentially highlight the versatility and robustness of TF-prioritizer. This is also important as your readers are not (certainly not all of them) from the mammary gland development field. As such, dedicating a large portion of your discussion about this process is too much. If you manage to highlight the versatility of your pipeline by capturing more than one specific developmental process will do the paper a great favor by highlighting the different ways TF-Prioritizer can be used, which in turn may attract more users to utilize your pipeline.

      2. I have an issue on how the 'Results and Discussion' section is organized. The authors dedicated separate subtopics for each TFs they prioritized and made literature review of their role in mammary gland development and lactation. My recommendation is to instead have one subtopic and discuss these TFs paragraph by paragraph in a concise manner. A more concrete way to reorganize this will be to separate these into two subtopics, (1) Known TFs with role in mammary gland development/lactation (2) Novel TFs with predicted role in mammary gland development/lactation. To make these reorganization easier/smooth, cutdown details of what you observe in the figures (e.g. p16, line 22-27 and p17, line 1-3), discuss the main message and put the detailed text about the figures in the Figure captions

      .3. All figures and tables should have more information in the caption including those in 'supplementary Material'Minor:1. p7 line 9, how often do one find these combinations of data types (modalities) in different conditions, cell types or models being studied. Could some of the HMs be replaced with other data modalities e.g ATAC-seq, DHS data or data from other chromosome profiling methods? Could the pipeline be adapted to incorporate Cut and tag/cut and run or is it specific to only ChIP-seq data. Authors should try to discuss whether this is possible or not.2. P13 line 3, the authors discuss that "ChIP-Atlas provides more than 362,121 datasets for six model organisms…". Could TF-Priotitizer be easily adapted to other databases/resources, which ChIP-Atlas do not cover (e.g. for other organisms) that the community might be interested in?3. p14 line 2 "... expressed gene for this analysis but focus on affinities only". Why this is the case is not argued/discussed. This and other choice of parameters would be nice if they are discussed under a separate subtopic to easily inform future readers/users of TF-Priotitizer

      1. Figures should be cited in chronological order. Adjust the text or reorder the figures

      2. When the authors discuss the evaluation of the prioritized TFs in separate sections, they often start with "In Figure Xa) …" and "Figure Yc) shows that …", etc, such kind of texts best fit as Figure captions instead of in the 'Results and Discussion'.

      3. p21 line 16, "We predicted that several Rho GTPase-associated genes are regulated by the predicted TFs" This sentence sounds a bit circular, you may rephrase as follows 'We propose that our predicted TFs regulate several Rho GTPase-associated genes

      '7. Figure 3 and 4 have the same general message/purpose and look redundant. This is reflected in the phrase '...(black arrows) as they are already known to be crucial in either mammary gland development or lactation.' and 'In the heatmaps, we can observe a clear separation of these target genes between the time points X and Y…'. I suggest the authors choose one of them as a main figure and place the other in Supplementary Material.

      1. On Fig.3,4 captions the authors should indicate what the black boxes represent. One can guess what they are from your main text but the captions could profit from a bit more detailed explanation. You should at-least describe some of the things that needs to be highlighted from the figures to easily guide your readers
    3. Background Eukaryotic gene expression is controlled by cis-regulatory elements (CREs), including promoters and enhancers, which are bound by transcription factors (TFs). Differential expression of TFs and their binding affinity at putative CREs determine tissue- and developmental-specific transcriptional activity. Consolidating genomic data sets can offer further insights into the accessibility of CREs, TF activity, and, thus, gene regulation. However, the integration and analysis of multi-modal data sets are hampered by considerable technical challenges. While methods for highlighting differential TF activity from combined chromatin state data (e.g., ChIP-seq, ATAC-seq, or DNase-seq) and RNA-seq data exist, they do not offer convenient usability, have limited support for large-scale data processing, and provide only minimal functionality for visually interpreting results.Results We developed TF-Prioritizer, an automated pipeline that prioritizes condition-specific TFs from multi-modal data and generates an interactive web report. We demonstrated its potential by identifying known TFs along with their target genes, as well as previously unreported TFs active in lactating mouse mammary glands. Additionally, we studied a variety of ENCODE data sets for cell lines K562 and MCF-7, including twelve histone modification ChIP-seq as well as ATAC-seq and DNase-seq datasets, where we observe and discuss assay-specific differences.Conclusion TF-Prioritizer accepts ATAC-seq, DNase-seq, or ChIP-seq and RNA-seq data as input and identifies TFs with differential activity, thus offering an understanding of genome-wide gene regulation, potential pathogenesis, and therapeutic targets in biomedical research.

      This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giad026 ), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer Xiaowo Wang : Markus et al. developed a new pipeline TF-Prioritizer to discover potential cell or tissue-specific transcription factors (TF) with ChIP-seq data of histone modification and RNA-seq data. TF-Prioritizer is mainly based on the framework of the state-of-art method TEPIC to model TFs regulating the gene. The authors extend TEPIC by integrating more information like differential gene expression using DEseq and linking the TF binding in cis-regulatory element to the gene expression using DYNAMITE. They also designed a new statistical method to rank the TFs across different cell types or in the time-serious cells. The authors also provide some cases to validate the pipeline. The pipeline is useful in biomedical research. The manuscript is well-written and provides enough details. The authors addressing or further considering the following issues may benefit readers.1. TF-Prioritizer requires ChIP-seq of histone modification (HM) as the input. It may support different types of HM. Users may want to know how to choose a proper set of HMs? Authors should evaluate some cases to show TF-Prioritizer's performance when inputting different HMs.2. ATAC-seq is more widespread for different kinds of cells or tissues. It seems TF-Prioritizer can also apply to ATAC-seq peaks. Why TF-Prioritizer does not support ATAC-seq now?3. On page 11, there may be some mistakes in the definition of BG(m) and FG(t,m). t \in TF(m) of BG(m) should be moved to FG(t,m)?4. The software is hard to install without sudo/root account. It would be better to provide a docker image that is ready for the users to run the software.

  15. May 2023
    1. Abstract

      This work has been published in GigaByte Journal under a CC-BY 4.0 license (https://doi.org/10.46471/gigabyte.81), and has published the reviews under the same license. These are as follows.

      **Reviewer 1. Jin Sun **

      Gomes-dos-Santos et al., have upgraded the freshwater mussel Margaritifera margaritifera genome with the usage of long-read sequencing. Overall, this version has been dramatically improved compared to the former one, with the increased N50 value and BUSCO score and decreased No. of contigs. Considering the important economic value of M. margaritifera and the high quality of assembly, I must congratulate the authors on this. However, in contrast to the high-quality assembly, I am a bit aware of the genome annotation part. To me, the number of gene models predicted is a bit higher compared with other molluscan genomes. This can also be reflected by the low proportion of gene models that can be annotated by Swissprot or GO etc. I suspect that the high number of gene models could be the consequence that only the ab initio evidence was applied in the current study. More sophisticated ways, such as EVM or maker, shall be used to see whether the number of gene models can be reduced without sacrificing the BUSCO scores on the gene models.

      Line 76, The official name shall be “Oxford Nanopore Technology (ONT)”.

      Fig. 1, it is interesting to see the wide distribution of M. margaritifera. I am a bit interested to know whether there are any genetic differentiations between the European population and the North American population.

      **Reviewer 2. Rebekah L. Rogers **

      Is there sufficient detail in the methods and data-processing steps to allow reproduction?

      Y. All methods seem standard and high quality for a genome release.

      If the authors could add a table comparing with other Unio genomes, that might be helpful. Gene numbers, BUSCO scores, N50s, and other relevant stats. It will help readers see the value of this more contiguous genome -V. ellipsiforma (Renaut et al.) -M nervosa -P. streckersonii

    1. ABSTRACT

      This work has been published in GigaByte Journal under a CC-BY 4.0 license (https://doi.org/10.46471/gigabyte.80), and has published the reviews under the same license. These are as follows.

      **Reviewer 1. Grace Mugumbate **

      Please add additional comments on language quality to clarify if needed

      Yes. First person reporting has been used with the word "We' used extensively.

      Are the data and metadata consistent with relevant minimum information or reporting standards?

      No. There is need to specify the type, size, standardisation and curation of the data that was used, especially when additional data was obtained from different databases.

      Is the data acquisition clear, complete and methodologically sound?

      Yes. Sources of data are indicated in the paper, however the size of the data sets and type of data is not clear.

      Is there sufficient detail in the methods and data-processing steps to allow reproduction?

      No. There is need to give more detail in the methods for reproducibility.

      Is there sufficient data validation and statistical analyses of data quality?

      No. Validation was performed, however no statistical analyses was mentioned.

      Is there sufficient information for others to reuse this dataset or integrate it with other data?

      No. More detail is needed on data retrieval to allow reuse of the dataset.

      Additional Comments:

      The Authors presented their work entitled 'Mycobacterial Metabolic Model Development for Drug Target Identification'. This is very innovative work that led to generation of M. laprae and M. abscessus models, important tools for drug target identification. Target identification for a number of infectious diseases provides information for structure-based molecular modification of new and alternative diseases. The target specific compounds will help reduce side effects among other things. Generation of the models by the authors is commendable.
      

      There are a few corrections: 1) Under Abstract: Line 4: Please note that Mycobacterium tuberculosis is not a disease but the bacterium that causes the diseases tuberculosis. 2) Mehtods, GEM reconstruction, curation and simulation (i) Line two: Name the "other organism specific databases" (ii) Give a brief description of the COBRApy and the GLPK even if the source had been given. 3) The Method section need to be more informative to allow for reproducibility.

      **Reviewer 2. Nagasuma Chandra **

      Is there sufficient detail in the methods and data-processing steps to allow reproduction?

      Yes. It would be useful if the authors could comment on how the models vary between the two species and with respect to M. tuberculosis. Specifically, a note on how the authors deal with alternate enzymes and whether they included enzymes specific to each species, would be helpful.

      Is the validation suitable for this type of data?

      Yes. A figure depicting the overall capability of the models would be useful

      Additional Comments:

      Genome-scale metabolic models are useful to the community as they can be used to address a variety of questions. It would be useful if the authors could include a section on the comparative performance of the models and link it to the known metabolic capability of these microbes.

    1. Abstract

      This work has been peer reviewed in GigaScience (see paper https://doi.org/10.1093/gigascience/giad028), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer 1: Philippe Boileau

      This manuscript introduces the new docker-based JupyterLab framework in Galaxy, describing its core components and demonstrating its use in the reproduction of two analyses. The proposed framework is also thoroughly compared to competitors, like Google’s Colab and Amazon’s SageMaker. This tool is bound to have an impact on the life sciences: it democratizes computational analyses and facilitates reproducibility. I thank the authors for their important work. However, I think that this technical note should be reviewed for grammatical errors and faulty punctuation. I’ve identified some such issues in the comments below but wasn’t able to address all of them. Included in the comments are other remarks which, if addressed, could strengthen some key takeaways. • The first sentence of the abstract states that AI programs require “powerful compute infrastructure” when applied to large datasets. I think readers would like to know how you qualify an infrastructure as “powerful”. A brief definition could be included in the second sentence instead of repeating “. . . hosted on a powerful infrastructure . . . ”. • Is it “JupyterLab” or “jupyterlab notebook”? The Project Jupyter site seems to use the former. Based on the documentation, JupyterLab is a web-based user interface that can open Jupyter notebooks (.ipynb files). • The statement “Artificial intelligence (AI) approaches such as machine learning (ML) and deep learning (DL) . . . ” implies that ML and DL are distinct aspects of AI. This distinction is insinuated throughout the rest of text. Isn’t DL a subset of ML? I suggest replacing “ML and DL algorithms” by “ML algorithms” and specifying “DL algorithms” only as needed. • I believe there’s a missing comma between “ecosystems” and “enabling” in the first sentence of the Docker container section. • Consider reformatting “A container runs . . . of the running software.” to “A container runs an isolated environment with minimal interactions between it and the host OS. Running software in a container is more secure.” • Related to the suggestion above: Can you explain why this increased security is necessary? An example might help emphasize the importance of a secure container. • I think “Docker container inherits . . . ” should be “The Docker container inherits . . . ”. Same goes for “Docker container is decoupled . . . ”. • Consider reformatting “Moreover, it can easily be extended by installing suitable packages only by adding their appropriate package names in its dockerfile.” to “Moreover, the Docker container is easily extended: additional software packages can be installed by adding their names to the dockerfile.” • Consider replacing “some of the popular ones are” by “including” • I believe there’s an unneeded comma between “. . . platform for both” and “rapid prototyping. . . ”. • I believe that there’s missing a word in the last sentence of the Features of jupyterlab and notebook infrastructure section: “. . . an H5 file.” • “google” and “amazon” should be capitalized. • Consider removing “and non-ideal” from Related infrastructure section. • I believe the comma in “. . . but they come at a price, . . . ” should be replace by a colon. • I believe there’s missing a comma between “. . . free of charge” and “similar to colab . . . ”. • Why is sharing a sessions’s resources across multiple notebooks more useful than operating each notebook in a separate session? Isn’t the latter preferable when a notebook causes a session to crash? • “deep learning” in the Implementation section should be replaced by “DL” for consistency. • I think that readers would find a link to your tool on Galaxy Europe useful: https://usegalaxy.eu/root?tool_id=interactive_tool_ml_jupyter_notebook. The same is true for your tutorial: I think readers would find a URL in the text more easily than in the references. However, the tool failed to execute on usegalaxy.edu with the following error message: “This tool is restricted to authorized users”. I was unable to follow the tutorial. Was this a one-off issue with the Galaxy servers?

    2. Abstract

      Reviewer 2: Milot Mirdita

      Kumar et al. present a Docker-based integration of Jupyter Notebooks in the Galaxy workflow system that can utilize GPUs. This notebook is also available in the Galaxy Europe instance.

      I was able to create a Galaxy Europe account, find the newly introduced Galaxy tool and submit a job. However, it remained stuck with the message "This job is waiting to run" and the job info "Stopped" for multiple hours. I was able to download the docker image and run it on a local server with multiple Nvidia GPUs. This resulted in a running Jupyter Lab, however running the GPU based examples resulted in driver mismatch errors/warnings (pynvml.nvml.NVMLError_LibRmVersionMismatch: RM has detected an NVML/RM version mismatch; kernel version 470.141.3 does not match DSO version 515.65.1 -- cannot find working devices in this configuration). Thus, the examples ran on CPU only. I did not try to resolve this issue and only repeated some examples.

      The authors show two use-cases for the GPU Jupyter Docker and provide a step-by-step tutorial for usage on Galaxy Europe. Shipping machine learning applications that utilize GPUs as Jupyter Notebooks has become popular recently and supporting these through well-known and freely accessible Galaxy servers, such as Galaxy Europe, would be of clear benefit to users. Additionally, it would be very valuable for method developers like me to easily deploy GPU-based methods to Galaxy servers.

      Major: - As mentioned before, I had issues getting a running Jupyter Lab on the Galaxy Europe server. Is this due to a limited number of GPUs or was this due to an error? - Our ColabFold Multiple Sequence Alignment server currently processes about 10-20k MSAs per day. We do not know how many of these are running on Google Colab or on users' local machines. However, a substantial number of predictions are running inside Google Colab. The authors claim that Google Colab's and Kaggle's resources are scarce. However, generally, users (with either free or pro accounts) are given an instance nearly immediately on Colab. I recognize that it is extremely difficult to compete with these commercial platform providers. However, providing a long-term, freely available and securely funded, platform with ML accelerators would be extremely beneficial for the whole community. I would like to see a discussion on what GPU resources are currently available to users of Galaxy Europe (and the whole Galaxy Project) and what plans exist to expand these in the future. - The size of the docker container (compressed ~10GB, uncompressed ~22GB) seems difficult to sustain. Both keeping up an up-to-date Docker image and ensuring the availability of older images for reproducibility looks difficult to me, especially with such fast moving dependencies such as machine learning frameworks. How do the authors plan to deal with this issue?

      Minor: - Please highlight the tutorial (https://training.galaxyproject.org/training-material/topics/statistics/tutorials/gpu_jupyter_lab/tutorial.html) on GitHub and inside the container readme (home_page.ipynb). It is very easy to overlook. I also nearly overlooked the example notebook repository (https://github.com/anuprulez/gpu_jupyterlab_ct_image_segmentation). I found it confusing, that I could not find the two shown example use-cases inside the Docker container. I only later figured out that I have to clone the example repository into the running container. - The manuscript highlights various workflow methods (elyra, kubeflow, airflow), however it needs clarification on how the Galaxy workflow integration works. I saw that it is possible to give input of another Galaxy output to the tool. I would appreciate a tutorial on how to make the GPU Jupyter Docker into part of a Galaxy workflow with multiple tools running. I think the above mentioned tutorials can be expanded to show how the output can be given to the next tool. - Docker Hub has introduced many business-model changes such as deleting container images that are rarely used, which poses a challenge for reproducibility. I know that Dr Grüning is involved in the Biocontainers project. I would recommend investigating if it is possible to combine these efforts to make this GPU container and derived containers long term available. - The Docker container is explicitly running as a root user, while the manuscript highlights the security benefits of Docker. The cited report by Baset et al. highlights the security benefits and the many security challenges that Docker containers pose. I suggest checking what security best practices for Docker containers are possible to implement, while still allowing GPUs to be exposed to users. - I recommend revising the manuscript for conciseness, with an additional focus on capitalization of words.

    1. Background

      This work has been peer reviewed in GigaScience (see paper https://doi.org/10.1093/gigascience/giad025), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer 1: Weilong Guo, PhD

      Patrick König and colleagues have built a web application for the interactive query, visualization and analysis of genomic diversity data, supportting population structure analysis on specific genetic elements, and data export. The application can also be easily used as a plugin for existing web application. According to its documentation, this application can be easily installed form pip, Docker and conda, which would be useful for population genomic studies. There are still several concerns about this manuscript.

      Major concerns:

      1. As for the SNP visualization function, there are only very limited numbers of SNPs can be read on the webpage, without function such as "zoom in" or "zoom out"(it is suggested to add such functions or similar functions). Although the application can export almost all the SNP sites of a whole VCF file, it is far from user-friendly.It is suggested to add a track of chromosomes showing the genomic windows under querying, allowing the cursor to select or adjust the genomic regions (UCSC-browser style), which is necessary for an intuitive user experience.

      2. The BLAST function could serve as a useful entry point. But what is the starting position of the query sequence when mapped on minus strand? The authors should make it more clearly explained on the website.

      3. TThe authors mentioned that their application would convert the inputted VCF file into Zarr format. Thus, more performance evaluation should be declared to show the advantages of this strategy (rather than using the VCF file directly).

      4. The authors should also compared the their applications with other similar existing web applications, such as CanvasDB, Gigwa, SNiPlay and SnpHub, to highlight their advantages and improvemences.

      Minor concerns:

      1. The analysis functions are still insufficient. Commonly used analysis tools or methods, such as haplotype analysis, STRUCTURE analysis, distribution of nucleotide diversity and selection sweep analysis, are also suggested to be supported.

      2. Ref. 22 is not completed.

    2. Background

      Reviewer 2: Armin Scheben

      The authors present the web app DivBrowse for visualizing genomic variant data. Their code is publicly available, and their web app is well-documented and provides several demonstration implementations for human, mouse and barley. The manuscript is well-written and concisely covers the key features of DivBrowse and summarizes the implementation of the software.

      I was able to test the demonstration website and was impressed with how smoothly everything ran and was set up. Due to time constraints, I was not able to test the installation and set up of DivBrowse but the documentation looks sufficient to allow easy set up by experts. Overall, I think this is a useful contribution to the community. One key issue I believe the authors should address, however, is that the manuscripts presents DivBrowse in a vaccum, not providing much mention of or comparison with existing software with overlapping functionality. Below I provide some further details illustrate my point and how it might be addressed, as well as listing several other minor comments.

      Main comment

      The authors rightly indicate in their introduction that the growing amounts of genomic data generated require robust solutions for visualization and exploration that does not require use of the command-line. But the authors fail to mention that there exists a considerable ecosystem of software that already does this. Moreover, some of the software available offers substantially expanded features compared to DivBrowse.

      To help readers better decide when DivBrowse might be the right choice for their needs compared to other options, the authors could cite existing software and provide some comparison. My knowledge of all available software is not exhaustive, but Wang et al. 2020 (https://doi.org/10.1093/gigascience/giaa060) in their publication of SnpHub provide a comparison table including SnpHub itself and Jbrowse. I would consider both of these tools for exploration and visualization of SNPs and additional data, similar to DivBrowse. Jbrowse is relatively widely used and considerably more feature-rich. The standalone offline tool TASSEL (https://academic.oup.com/bioinformatics/article/23/19/2633/185151) also offers many options for visualisation and exploration and analysis of VCF data offline. There may also be other tools I am not aware of, and readers would likely benefit from some brief overview of the landscape and the pros and cons of each piece of software and what differentiates DivBrowse.

      Minor comments

      The authors can consider the minor comments below as 'take it or leave it' comments. I do not think it is essential to address these, but in my view they may enhance the manuscript.

      1) In the discussion, the authors point out the efficiency and low latency of DivBrowse, however this is not quantified in the manuscript. If it were technically feasible without substantial effort, it might be useful to quantify in some way just how efficient DivBrowse can be, especially if this could be one of the stand-out features of DivBrowse.

      2) The authors use divergence Bezier curves to increase the amount of variant calls that can be visualized. This is helpful and a useful default. However, invariant sites can also be of considerable evolutionary and breeding/medicinal interest. When collapsing invariant sites, they become indistinguishable from unmapped regions. This is a fundamental issue and many VCF files may not encode information on invariant sites, so it may not be possible to develop robust functionality that allows users to also show invariant sites optionally. Still, this point may be worth briefly mentioning in the discussion, if the authors agree it is noteworthy.

      3) One advantage of visualization of relatively raw data like SNPs is that it can reveal patterns that are less obvious in other types of data exploration. To fully take advantage of this tools like Jbrowse allow export of the browser window in SVG format, allowing users to incorporate images into high-resolution figures. I don't expect the authors to necessarily implement this feature for this review, but it may be worth adding it to the list of potential enhancements that could be implemented based on user demand.

    1. Motivation

      Reviewer 2: Mulin Jun Li

      In this manuscript, the authors updated their previous ReMM to the GRCh38 human genome build, supported convenient and fast data source. Then, the authors take some examples to demonstrate the usability of the resource. It's original to point that the difference in prioritized tools between different genome build. However, we have following concerns and comments:

      Major: 1. How to deal with missing value variants in test datasets when compare new ReMM with other tools, the author mentioned that ExPecto annotated only half of the million negative variants. 2. Although the CADD used the same negative training dataset, it's not suitable to compare it in the ReMM training dataset. How those tools performance in the independent test datasets. 3. The author presumes that new genome build will get better performance, is there some evidence can support this perspective, like the distribution of feature or training data in different genome build. 4. Other existing similar tools can prioritization disease-causal noncoding variant, such as regBase-PAT, NCBoost, ncER, etc. can the authors compare new version of ReMM with these tools.