876 Matching Annotations

Apr 2025
www.biorxiv.org www.biorxiv.org

Healthy microbiome - moving towards functional interpretation

2
1. GigaScience 03 Apr 2025
  
  in GigaScience
  
  AbstractMicrobiome-based disease prediction has significant potential as an early, non-invasive marker of multiple health conditions linked to dysbiosis of the human gut microbiota, thanks in part to decreasing sequencing and analysis costs. Microbiome health indices and other computational tools currently proposed in the field often are based on a microbiome’s species richness and are completely reliant on taxonomic classification. A resurgent interest in a metabolism-centric, ecological approach has led to an increased understanding of microbiome metabolic and phenotypic complexity revealing substantial restrictions of taxonomy-reliant approaches. In this study, we introduce a new metagenomic health index developed as an answer to recent developments in microbiome definitions, in an effort to distinguish between healthy and unhealthy microbiomes, here in focus, inflammatory bowel disease (IBD). The novelty of our approach is a shift from a traditional Linnean phylogenetic classification towards a more holistic consideration of the metabolic functional potential underlining ecological interactions between species. Based on well-explored data cohorts, we compare our method and its performance with the most comprehensive indices to date, the taxonomy-based Gut Microbiome Health Index (GMHI), and the high dimensional principal component analysis (hiPCA)methods, as well as to the standard taxon-, and function-based Shannon entropy scoring. After demonstrating better performance on the initially targeted IBD cohorts, in comparison with other methods, we retrain our index on an additional 27 datasets obtained from different clinical conditions and validate our index’s ability to distinguish between healthy and disease states using a variety of complementary benchmarking approaches. Finally, we demonstrate its superiority over the GMHI and the hiPCA on a longitudinal COVID-19 cohort and highlight the distinct robustness of our method to sequencing depth. Overall, we emphasize the potential of this metagenomic approach and advocate a shift towards functional approaches in order to better understand and assess microbiome health as well as provide directions for future index enhancements. Our method, q2-predict-dysbiosis (Q2PD), is freely available (https://github.com/Kizielins/q2-predict-dysbiosis).
  
  This work has been peer reviewed in GigaScience (https://doi.org/10.1093/gigascience/giaf015), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 2: Saritha Kodikara
  
  In this study, the authors present a novel metagenomic health index designed to differentiate between healthy and unhealthy microbiomes. This area of research is crucial for developing a non-invasive, cost-effective method to assess patient health status. However, I have several suggestions that I believe will enhance the study and address some key points.
  
  Main Comments:
  
  1.) The study would benefit from additional post-analysis to provide greater depth. Although the authors applied their approach to several diseases, they did not elaborate on the significance of individual microbiome features across different diseases. For instance, the GMHI parameters were identified as least important in IBD—does this observation hold universally across all diseases analysed?
  
  2.) The index Q2D performed worse in AGP1 compared to HMP2 and AGP2. Is there a specific reason for this discrepancy? For example, does the index underperform in the heterogeneous functional landscape presented in AGP1 (Figure 2C)? An explanation for the reduced performance in this cohort would provide valuable insights into the method's performance under varying conditions.
  
  3.) It would be beneficial to make all processed data and relevant scripts available in a GitHub repository to ensure that the results presented in the paper can be replicated by other researchers.
  
  4.) When attempting to run the script available at https://github.com/Kizielins/q2-predict-dysbiosis, I encountered an error related to the scikit-learn version. The script appears to be compatible with version 1.2.2, whereas I was using version 1.4.2. Please consider updating the script or providing instructions for resolving version compatibility issues.
  
  5.) The rationale behind considering only positive correlations when calculating the index is unclear. It would be helpful to clarify why negative correlations were excluded from the index calculations.
  
  6.) In analysing longitudinal alterations, did the authors account for dependencies from previous time points Q2D index? If not, how do these longitudinal alterations differ from those observed in independent studies?
  
  7.) For each dataset analysed, additional details would be useful, such as the number of samples, species, functions, core functions, and the number of species remaining after applying the MDFS algorithm.
  
  8.) On Page 13, the authors state that they chose GMHI as their benchmark because hiPCA and Shannon entropy produced worse results for the HMP2 cohort. However, Supplementary Table 3 indicates that Shannon entropy had a lower p-value than GMHI in the Mann-Whitney U test.
  
  Minor comments:
  
  1) Page 11 Original: "Collecting information on feature importance at every iteration of the cross-validation procedure model, we consistently identified the two GMHI parameters as the least important (Figure 5b)." Suggested: "Collecting information on feature importance at every iteration of the cross-validation procedure model, we consistently identified the two GMHI parameters as the least important (Figure 4b??)."
  
  2) Page 12 Original: "Most importantly, Q2PD produced visually the highest scores for all healthy in comparison to unhealthy cohorts." Suggested: "Most importantly, Q2PD produced visually the highest median?? scores for all healthy in comparison to unhealthy cohorts."
  
  3) Page 12 Original: "Q2PD was also the only index to produce a statistically significant difference between Healthy and Obese in HMP2" Suggested: "Q2PD was also the only index to produce a statistically significant difference between Healthy and Obese in AGP2??"
  
  4) Page 14 Original: "The Q2PD important in all datasets that were included in its training and validation, specifically AGP_1, AGP_2 and HMP2 (Table 1, Supplementary Figure 7)." Suggested: "The Q2PD important in all datasets that were included in its training and validation, specifically AGP_1, AGP_2 and HMP2 (Table 1, Supplementary Figure 8??)."
2. GigaScience 03 Apr 2025
  
  in GigaScience
  
  AbstractMicrobiome-based disease prediction has significant potential as an early, non-invasive marker of multiple health conditions linked to dysbiosis of the human gut microbiota, thanks in part to decreasing sequencing and analysis costs. Microbiome health indices and other computational tools currently proposed in the field often are based on a microbiome’s species richness and are completely reliant on taxonomic classification. A resurgent interest in a metabolism-centric, ecological approach has led to an increased understanding of microbiome metabolic and phenotypic complexity revealing substantial restrictions of taxonomy-reliant approaches. In this study, we introduce a new metagenomic health index developed as an answer to recent developments in microbiome definitions, in an effort to distinguish between healthy and unhealthy microbiomes, here in focus, inflammatory bowel disease (IBD). The novelty of our approach is a shift from a traditional Linnean phylogenetic classification towards a more holistic consideration of the metabolic functional potential underlining ecological interactions between species. Based on well-explored data cohorts, we compare our method and its performance with the most comprehensive indices to date, the taxonomy-based Gut Microbiome Health Index (GMHI), and the high dimensional principal component analysis (hiPCA)methods, as well as to the standard taxon-, and function-based Shannon entropy scoring. After demonstrating better performance on the initially targeted IBD cohorts, in comparison with other methods, we retrain our index on an additional 27 datasets obtained from different clinical conditions and validate our index’s ability to distinguish between healthy and disease states using a variety of complementary benchmarking approaches. Finally, we demonstrate its superiority over the GMHI and the hiPCA on a longitudinal COVID-19 cohort and highlight the distinct robustness of our method to sequencing depth. Overall, we emphasize the potential of this metagenomic approach and advocate a shift towards functional approaches in order to better understand and assess microbiome health as well as provide directions for future index enhancements. Our method, q2-predict-dysbiosis (Q2PD), is freely available (https://github.com/Kizielins/q2-predict-dysbiosis).
  
  This work has been peer reviewed in GigaScience (https://doi.org/10.1093/gigascience/giaf015), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 1: Vanessa Marcelino
  
  The manuscript proposes a new method to distinguish between healthy and diseased human gut microbiomes. The topic is timely, as to date, there is no consensus on what constitutes a healthy microbiome. The key conceptual advance of this study is the integration of functional microbiome features to define health. Their new computational approach, q2-predict-dysbiosis (Q2PD), is open source and available on GitHub.
  
  While the manuscript is conceptually innovative and interesting for the scientific community, there are several major limitations in the current version of this study.
  
  To develop the Q2PD, they define features associated with health by comparing it with microbiome samples from IBD patients. There are many more non-healthy/dysbiotic phenotypes beyond IBD, therefore it is not accurate to use IBD as synonymous of dysbiosis as done throughout this version of the paper.
  
  The study initially tests the performance of Q2PD against other gut microbiome health indexes (GMHI and hiPCA) using the same data that was used to select the health-associated features of Q2PD. Model performance should be assessed on independent data. On a separate analysis, they do use different datasets (from GMHI and hiPCA), but these datasets seem to be incomplete - GMHI and hiPCA publications have included 10 or more disease categories, and it is unclear why only 4 categories are shown in this study.
  
  While Q2PD does provide visible improvements in differentiating some diseases from healthy phenotypes, the accuracy and sensitivity of Q2PD isn't clear. To adopt Q2PD, I would like to know what are the chances that the classification results will be correct.
  
  There is very little documentation on how to use Q2PD. What are the expect outputs for example, do we need to chose a threshold to define health? Is the method completely dependent on Humann and Metaphlan outputs, or other formats are accepted? The test data contain some samples with zero counts. I got an error when trying it with the test data (ValueError: node array from the pickle has an incompatible dtype…).
  
  Therefore, I recommend including a range of disease categories to develop Q2PD and use independent datasets to validate the model in terms of accuracy and sensitivity. Alternatively, consider focusing this contribution on IBD. Making the code more user friendly will drastically increase the adoption of Q2PD by the community.
  
  Please also use page and line numbers when submitting the next version. Other suggestions:
  
  Abstract: I recommend replacing 'attributed' with 'linked', as 'attributed' suggests that dysbiosis may be causing (rather than reflecting) disease.
  
  Results: Please indicate what it is meant by 'function' here - it will be good to clarify that this method uses Metaphlan's read-based approach to identify metabolic pathways. What is used, pathway completeness or abundance?
  
  Results regarding Figure 3a are difficult to interpret. Is 'non-negatively correlated' the same as 'positively correlated'? What does the colour gradient represent - their abundance in those groups, or the strength of their correlation?
  
  "We observed that the prevalence of the pairs positively correlated in health was higher than in a number of disease-associated groups (Figure 3b)" . This is a very generalised statement considering that only half of the comparisons were significant. How co-occurring species were selected?
  
  "To test this, we compared the contributions of MDFS-identified species to "core functions" in different groups (Supplementary Figure 4)." How was this comparison made, based on species correlations? The caption of these figures could include more detail - it just says 'Top species contributions to functions.' but how do you define 'top' ? What do the colours represent?
  
  'This finding was congruent with our earlier suspicions of functional plasticity; modulation of function and thus altered connectivity in the interaction network, shifting towards less abundant, non-core functions upon perturbation of homeostasis.' This is reasonable, but I don't understand how you can draw this conclusion from these figures where there seems to be no significant difference between health and disease.
  
  Section 'Testing q2-predict-dysbiosis, GMHI and hiPCA accuracy of prediction for healthy and IBD individuals'
  
  What is the difference between fraction of "core functions" found the fraction of "core functions" among all functions?
  
  "Most importantly, Q2PD produced visually the highest scores for all healthy in comparison to unhealthy cohorts" . This was not statistically significant. In fact, GMHI finds more significant differences between health and disease than Q2PD.
  
  Sup. Figure 7 - would be informative to add the name/description of these metabolites not just their ID).
  
  'Although the threshold of 0.6 as determinant of health by the Q2PD was not applicable to the new datasets'. Does the threshold to define health with Q2PD change depending on the dataset? What are the implications of this for the applicability of this index?
  
  Effects of sequencing depth - this is a very good addition to the paper, the effects of sequencing depth can be profound but are ignored in most studies, so I commend the authors for doing this here. It would be even better, in my opinion, if this was done with the same datasets used to test/compare Q2PD with other methods, as using a different dataset here adds a new layer of confounding factors.
  
  'the GMHI and the hiPCA produced the opposite trend, wrongly indicating patient recovery.' The difference here is striking, what is driving this trend?
  
  The Gut Microbiome Wellness Index 2 (GMWI2) is now published. I don't think it needs to be part of the benchmarking, but it could be acknowledged/cited here.
  
  Methods: More information on how the data was processed is needed - how were the abundance tables normalized? Which output from Humann was used for downstream analyses?
  
  To ensure reproducibility, please provide the scripts/code used for analyses and figures.
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2023.12.04.569909v6
www.biorxiv.org www.biorxiv.org

Genomic and transcriptomic analyses of Heteropoda venatoria reveal the expansion of P450 family for starvation resistance in spider

2
1. GigaScience 03 Apr 2025
  
  in GigaScience
  
  AbstractBackground Spiders generally exhibit robust starvation resistance, with hunting spiders, represented by Heteropoda venatoria, being particularly outstanding in this regard. Given the challenges posed by climate change and habitat fragmentation, understanding how spiders adjust their physiology and behavior to adapt to the uncertainty of food resources is crucial for predicting ecosystem responses and adaptability.Results We sequenced the genome of H. venatoria and, through comparative genomic analysis, discovered significant expansions in gene families related to lipid metabolism, such as cytochrome P450 and steroid hormone biosynthesis genes. We also systematically analyzed the gene expression characteristics of H. venatoria at different starvation resistance stages and found that the fat body plays a crucial role during starvation in spiders. This study indicates that during the early stages of starvation, H. venatoria relies on glucose metabolism to meet its energy demands. In the middle stage, gene expression stabilizes, whereas in the late stage of starvation, pathways for fatty acid metabolism and protein degradation are significantly activated, and autophagy is increased, serving as a survival strategy under extreme starvation. Additionally, analysis of expanded P450 gene families revealed that H. venatoria has many duplicated CYP3 clan genes that are highly expressed in the fat body, which may help maintain a low-energy metabolic state, allowing H. venatoria to endure longer periods of starvation. We also observed that the motifs of P450 families in H. venatoria are less conserved than those in insects, which may be related to the greater polymorphism of spider genomes.Conclusions This research not only provides important genetic and transcriptomic evidence for understanding the starvation mechanisms of spiders but also offers new insights into the adaptive evolution of arthropods.
  
  This work has been peer reviewed in GigaScience (https://doi.org/10.1093/gigascience/giaf019), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 2: Sandra Correa-Garhwal
  
  The manuscript "Genomic and transcriptomic analyses of Heteropoda venatoria reveal the expansion of P450 family for starvation resistance in spider" uses comparative genomics to study the underlying mechanisms of starvation resistance. I appreciate that the authors have produced a high-quality genome for an RTA species. The methods are sound and some interesting gene families are highlighted as key factors in starvation resistance.
  
  One primary concern I have relates to the study's setup and hypothesis. As currently written, the study comes across as a fishing expedition rather than a focused research project. Although the introduction is informative, it lacks a clear rationale for including this particular species. The reasoning only becomes apparent at the end of the gene family expansion and contraction section. Additionally, I am unsure if being an active hunter makes feeding more unpredictable compared to web-based prey capture. I recommend incorporating this information into the introductory paragraph to better establish the context for the analysis. While terms like "autophagy" and "energy homeostasis" are appropriate for a scientific audience, consider briefly defining them for clarity, especially if the intended audience might not be familiar with all the terminology. Although authors mention that there is no high-quality genome sequence for H. venatoria, it could be helpful to elaborate on why this is significant for understanding starvation resistance. A brief explanation of how genomic data could enhance understanding of the molecular mechanisms involved would strengthen this point. The conclusion provides a clear goal for your study, but it could be more impactful. You might want to emphasize the broader implications of your research findings for ecological conservation and biodiversity. End with a statement about the importance of understanding these mechanisms in the context of preserving ecosystems and addressing challenges posed by climate change.
  
  For the discussion, while the content is detailed, some parts feel slightly repetitive or could be more concise. For instance, the description of P450 gene expression could be streamlined by removing redundant mentions of their role in metabolic rate regulation. Example: In the discussion section "Interestingly, we found that some P450 families are expanded in H. venatoria, and most P450 genes are more highly expressed in the fat body than in other tissues…" This point is later reiterated in the sentence about other spider species. These ideas could be combined for efficiency. The paragraph about the phylogenetic analysis of the CYP3 clan could be shortened. While it is an interesting finding, some of the details (like the number of genes or proteins) might be better suited for the main text rather than a summary. Focusing more on the functional implications of these duplications would keep the reader engaged. Though the findings are well-explained, the broader significance could be emphasized more explicitly. For example, why is understanding these mechanisms important for the field of arachnid biology, evolutionary biology, or even practical applications (e.g., pest control, conservation)? You could add a closing sentence that ties everything together and highlights the broader relevance of the findings, such as the evolutionary or ecological importance of these adaptations in spiders.
  
  Other comments: Last paragraph of the introduction: When introducing Heteropoda venatoria, please spell out the species name the first time that is used. The sentence "However, these findings indicate that H. venatoria does not feed in a stable manner and often experiences periods of starvation." Does not fit the rest of the text. Finding from what study? Transcription design for starvation resistance in H. venatoria section: First sentence: What samples? confusing to start like this. Please add information about the samples. You could delete "the samples of H. venatoria were subjected to" it will read better. Are all 23 CYP# clan genes on chromosome 4 tandemly arrayed? Figure 4 - add more information about the figure. For pannel C, What do the red lines show? Grey? Numbers in the circles? While I know what they represent, other readers might not. The finding that H. venatoria chromosomes have undergone lots of chromosomal fragmentation is very interesting, and it is clearly shown on the figure. Which is why I think that more detail is needed. In this sentence "In Uloborus diversus, members of this subfamily are located on Chr5 and an unanchored scaffold." You need to specify which members. Figure 5 - Include a description of the tissues. What is Epi? Ducts? Tail?
2. GigaScience 03 Apr 2025
  
  in GigaScience
  
  AbstractBackground Spiders generally exhibit robust starvation resistance, with hunting spiders, represented by Heteropoda venatoria, being particularly outstanding in this regard. Given the challenges posed by climate change and habitat fragmentation, understanding how spiders adjust their physiology and behavior to adapt to the uncertainty of food resources is crucial for predicting ecosystem responses and adaptability.Results We sequenced the genome of H. venatoria and, through comparative genomic analysis, discovered significant expansions in gene families related to lipid metabolism, such as cytochrome P450 and steroid hormone biosynthesis genes. We also systematically analyzed the gene expression characteristics of H. venatoria at different starvation resistance stages and found that the fat body plays a crucial role during starvation in spiders. This study indicates that during the early stages of starvation, H. venatoria relies on glucose metabolism to meet its energy demands. In the middle stage, gene expression stabilizes, whereas in the late stage of starvation, pathways for fatty acid metabolism and protein degradation are significantly activated, and autophagy is increased, serving as a survival strategy under extreme starvation. Additionally, analysis of expanded P450 gene families revealed that H. venatoria has many duplicated CYP3 clan genes that are highly expressed in the fat body, which may help maintain a low-energy metabolic state, allowing H. venatoria to endure longer periods of starvation. We also observed that the motifs of P450 families in H. venatoria are less conserved than those in insects, which may be related to the greater polymorphism of spider genomes.Conclusions This research not only provides important genetic and transcriptomic evidence for understanding the starvation mechanisms of spiders but also offers new insights into the adaptive evolution of arthropods.
  
  This work has been peer reviewed in GigaScience (https://doi.org/10.1093/gigascience/giaf019), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 1: Hui Xiang
  
  In this study, the authors deciphered the chromosome-level genome of a RTA spider Heteropoda venatoria with large body size and generated comprehensive comparative transcriptomes of fat body and whole body among CK and starvation status. Generally, this study added important genomic and transcriptomic data of spiders and provided some cues in understanding the molecular changes during starvation. However, the organization of the manuscript is quite problematic. 1. As to the Results section, please be concise and highlight the main results，avoiding accumulating complex results. Do not present too many statements in terms of introduction and discussion in Results. Do not raise too many hypotheses in the results. 2. As for the involvement of the Hippo signaling pathway in lipid metabolism regulation, the cited literature and mentioned genes are not related to the results of this study. As for the analysis of P450 results, the descriptions of structural analysis are quite complex and difficult to understand. The authors did not explain clearly the relationship between the expansion of P450 genes and hunger resistance in the results of this study. 3. The author's analyses of DEG enrichment results in transcriptome analysis is confusing. Firstly,I can't agree with the authors in that "During the early stage of starvation (from CK to 2 W), many genes, specifically those involved in oxidative phosphorylation and thermogenesis pathways, were up-regulated (Fig. 2E). These findings indicate that during the early starvation stage, energy metabolism in H. venatoria occurs regularly, with sufficient supply of energy." There are a batch of DEGs between 2W and CK, and a lot of pathways involved in neurodegeneration related pathways. How to explain these changes? Secondly, as to 4W to 8W, I can not understand the relationship of down-regulation of hippo signaling pathway to the authors' speculation that "H. venatoria may reduce its cellular glucose uptake and utilization to adjust to the food-scarce environment.", as this pathway involved in lipid metabolism, as the authors stated. Thirdly, from 14 W to 19 W, pathways such Lysosome and apoptosis were down-regulated instead of up-regulated. So how the authors thought autophagy became more active? 4. "We speculate that during the evolution of spider genomes, two types of repeat sequences, TcMar and LTR sequences, had a significant impact on the size of spider genomes. Interestingly, we found that in H. venatoria chromosomes, regions with a high proportion of repeats also presented an increase in GC content (Fig. 1B)" The author's conclusion that high repeat region has higher CG content is based on Fig1B alone, which is too arbitrary. They needs more solid evidence and more detailed analysis. For example, the GC content of TE region could be compared with that of whole genome, and the GC content of gene region. The significance of the relevant results should be explained. In addition, the author should make a more convincing discussion of this result based on the more literature. 5. "We gathered genomic data and annotations for one scorpion and seven chromosome-level spider genomes using the scorpion as an outgroup [35-42]"。Many spider genomes have been published at the chromosomal level. What were the principles behind the spider genomes the authors selected in this study? 6. "Transcriptome design for starvation resistance in H. venatoria" in Results should be partially moved Methods and here the authors should straightforwardly highlighted the results . 7. I can't understand the significance of Fig 2C. The authors did not explain it in the manuscript, either. 8. "The PCA results from both the fat body and whole-body transcriptomes indicated that H. venatoria transcriptome at 19 weeks of starvation was markedly distinct from that at other stages (Fig. 2A, B). Consequently, we conducted a differential analysis of the transcriptome at 19 weeks." Please clarify how the comparative transcriptomes were conducted for differential analysis. 9. The language should be polished.
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2024.07.31.605936v1
Mar 2025
www.biorxiv.org www.biorxiv.org

CompactTree: A lightweight header-only C++ library for ultra-large phylogenetics

2
1. GigaScience 23 Mar 2025
  
  in GigaByte
  
  Editors Assessment:
  
  As volumes of viral and bacterial sequence data grow exponentially, the field of computational phylogenetics now demands resources to manage the burgeoning scale of this input data. This study introduces CompactTree, a C++ library designed for ultra-large phylogenetic trees with millions of tips. To address these scalability issues while maintaining ease of incorporation into external code bases, CompactTree is a header-only library with enhanced performance utilizing minimal dependencies, optimized node representation, and memory-efficient tree structure schemes. Resulting in significantly reduced memory footprints and improved processing times. Peer review requested some more detail on the functionality and some real-world examples, demonstrating the current utility of the tool. Although primarily supporting the (text-based) Newick format, the increased and extensibility scalability holds promise for multiple biological and epidemiological applications supporting more complex formats such as Nexus and NeXML. The tool is open source (GPLv3 licensed) and available in GitHub: https://niema.net/CompactTree
  
  This evaluation refers to version 1 of the preprint
  
  Summary
2. GigaScience 23 Mar 2025
  
  in GigaByte
  
  AbstractMotivation The study of viral and bacterial species requires the ability to load and traverse ultra-large phylogenies with tens of millions of tips, but existing tree libraries struggle to scale to these sizes.Results We introduce CompactTree, a lightweight header-only C++ library for traversing ultra-large trees that can be easily incorporated into other tools, and we show that it is orders of magnitude faster and requires orders of magnitude less memory than existing tree packages.Availability CompactTree can be accessed at: https://github.com/niemasd/CompactTreeContact niema{at}ucsd.eduSupplementary information Supplementary data are available at Bioinformatics online.
  
  This work has been published in GigaByte Journal under a CC-BY 4.0 license (https://doi.org/10.46471/gigabyte.152). These reviews (including a protocol review) are as follows.
  
  Reviewer 1. Jeet Sukumaran
  
  Is the documentation provided clear and user friendly? Yes. Excellent documentation. A pleasure to read. Are there (ideally real world) examples demonstrating use of the software? No.
  
  Reviewer 2. Ziqi Deng
  
  Is installation/deployment sufficiently outlined in the paper and documentation, and does it proceed as outlined? Yes. I'm able to run all the tests and used CompactTree c++ correctly except for encounter issue installation installing Python Wrapper via pip install CompactTree.
  
  Are there (ideally real world) examples demonstrating use of the software? Yes. CompactTree has provided examples of simulated trees for testing comparing to other peer packages. In the meanwhile it mentioned its ability to load the ~22M nodes greengenes2 tree. It would be great to see the test workflow so users can verify.
  
  Additional Comments: CompactTree is aimed at a very specific task, that of loading large phylogenetic trees with millions of nodes. The result shows that it is significantly faster than the other peer tools not only in loading but also in traversing trees, with less peak memory usage. It also includes the test workflow for users to repeat the test in comparison with other peer tools.
  
  Reviewer 3. Giorgio Bianchini
  
  Is the language of sufficient quality? Yes. It is slightly confusing that the paper is written using plural pronouns ("We"), when there is a single author.
  
  Is there a clear statement of need explaining what problems the software is designed to solve and who the target audience is? No. The statement of need is present; however, it does not clearly explain what kinds of problems the software will be able to solve, beyond generic statements about addressing scalability issues. The aims of the library should be explored in more detail: as noted by the author, this library offers great speed and efficiency, but at the cost of reduced flexibility and functionality compared to other tools. Speed and efficiency are always good things, but what does the library actually do? A very fast library that does nothing is not particularly useful. So, what specific analyses does CompactTree allow, that would be impractical using other tools? For example, they could select a case study from the literature, where the analyses were limited by the algorithm, and use their library to extend the analysis to a larger dataset. The author mentions clustering, ancestral state reconstruction, and transmission risk prediction as examples of analyses that involve tree traversals, so they could start here (although I am not convinced that the efficiency of the tree representation is the computational bottleneck in these cases). The results should also be briefly mentioned in the abstract. Furthermore, the author mentions a number of packages used to analyse trees, but these are all Python packages. Since CompactTree is presented as a C++ library, this seems odd; other tools and programming languages should be mentioned/compared. For example, “ape” and “phytools” are very popular R packages, while “Bio++” is another C++ library; a literature review (or a simple web search) may reveal other such libraries. Also, the reference given for bp (“[4]”) is incorrect.
  
  Is installation/deployment sufficiently outlined in the paper and documentation, and does it proceed as outlined? Yes. Everything works fine if the header is included in a single source file, but if multiple distinct files contain the #include statement, a compilation error will occur due to the multiple definitions. In a real-world application, the library would reasonably need to be included in multiple source files, so this should be fixed.
  
  Is the documentation provided clear and user friendly? Yes. The documentation "Cookbook" is very nicely organised.
  
  Have any claims of performance been sufficiently tested and compared to other commonly-used packages? No. While the author compares CompactTree to a number of Python packages, no comparison is made against tools that use other programming languages. In particular, the author states that there is no C++ library for loading and traversing phylogenetic trees; however, as I mentioned, at least Bio++ exists and appears to be reasonably well cited. Furthermore, the memory plot does not consider the baseline memory usage. This is evident in the first two datapoints (n=100 and n=1000) for each tool, which show a very small difference, despite the leaf count increasing by an order of magnitude. If the first datapoint is subtracted from all subsequent datapoints, the memory plot looks quite similar to the other plots. If you re-run the benchmarks to include other tools, I would suggest including a “control” datapoint with a very small n (or even, loading the library without opening a tree), and subtracting this from all other datapoints; this will provide an estimate of the memory actually used to load the trees.
  
  Are there (ideally real world) examples demonstrating use of the software? No. As I mentioned above, having at least one example demonstrating an analysis that is significantly improved by the use of this library would be beneficial. Discussion of the improvements should also consider usability trade-offs in a real-world scenario.
  
  Is automated testing used or are there manual steps described so that the functionality of the software can be verified? No.
  
  Additional Comments: The library looks promising and is reasonably well documented, the only two things that are really missing are a real-world practical application and a comparison with other relevant alternatives (especially Bio++). A large portion of the manuscript is spent describing how the library could be improved, rather than what it can currently do. This could be summarised in just one or two sentences, thus leaving more space for describing the real-world example.
Visit annotations in context

Tags

Summary

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2024.07.15.603593v1
www.biorxiv.org www.biorxiv.org

Draft Genome of the Endangered Visayan Spotted Deer (Rusa alfredi), a Philippine Endemic Species

2
1. GigaScience 17 Mar 2025
  
  in GigaByte
  
  Editors Assessment:
  
  The Visayan spotted deer (Rusa alfredi), is a small, endangered, primarily nocturnal species of deer found in the rainforests of the Visayan Islands in the Philippines. The present study reports the first draft genome assembly for the species, addressing a critical gap in genomic data for this IUCN-redlisted cervid. Using Illumina sequencing, the resulting genome assembly spans 2.52 Gb in size with a BUSCO completeness score of 95.5% and encompasses 24,531 annotated genes. Phylogenetic analysis suggests a close evolutionary relationship between R. alfredi and Cervus species suggesting that the genus Rusa is sister to Cervus. Peer-review teased out more benchmarking results and the annotation files, demonstrating this genomic resource is useful and usable for advancing population genetics and evolutionary studies, thereby informing conservation strategies and enhancing breeding programs for the critically threatened species. Providing whole genome sequences for other native species of Rusa could further provide genomic resources for detecting hybrids, which will also help the management and monitoring of these species, especially for the reintroduction of captive populations in the wild.
  
  This evaluation refers to version 1 of the preprint
  
  Summary
2. GigaScience 17 Mar 2025
  
  in GigaByte
  
  ABSTRACTThe Visayan Spotted Deer (Rusa alfredi) is an endangered and endemic species in the Philippines facing significant threats from habitat loss and hunting. It is considered as the world’s most threatened deer species by the International Union for Conservation of Nature (IUCN) thus its conservation has been a top priority. Despite its status, there is a notable lack of genomic information available for R. alfredi and the genus Rusa in general. This study presents the first draft genome assembly of the Visayan Spotted Deer (VSD), Rusa alfredi, using Illumina short-read sequencing technology. The RusAlf_1.1 assembly has a 2.52 Gb total length with a contig N50 of 46 Kb and scaffold N50 size of 75 Mb. The assembly has a BUSCO complete score of 95.5%, demonstrating the genome’s completeness, and includes the annotation of 24,531 genes. Phylogenetic analysis based on single-copy orthologs reveals a close evolutionary relationship between the R. alfredi and the genus Cervus. The availability of the RusAlf_1.1 genome assembly represents a significant advancement in our understanding of the VSD. It opens opportunities for further research in population genetics and evolutionary biology, which could contribute to more effective conservation and management strategies for this endangered species. This genomic resource can help in assuring the survival of Rusa alfredi in the country.
  
  This work has been published in GigaByte Journal under a CC-BY 4.0 license (https://doi.org/10.46471/gigabyte.150). These reviews (including a protocol review) are as follows.
  
  Reviewer 1. Endre Barta
  
  Are all data available and do they match the descriptions in the paper? No. The authors provided only the assembly in Fasta and GenBank format and the contigs (scaffolds?) in GenBank format. Neither the annotation nor the raw Illumina reads are available.
  
  Are the data and metadata consistent with relevant minimum information or reporting standards? Yes. In the cases where the data is uploaded, the provided metadata is consistent.
  
  Is there sufficient detail in the methods and data-processing steps to allow reproduction? The exact parameters used during the processing are completely missing. For example, it is unclear how the RagTag-based correcting and scaffolding were carried out.
  
  Is there sufficient data validation and statistical analyses of data quality? Not my area of expertise
  
  Is the validation suitable for this type of data? No. Without having the raw Illumina reads and the exact command line parameters used, it is not possible to validate the provided results.
  
  Additional Comments:
  
  Assembling the reference genomes of endangered species is a task of immense importance, with the potential to significantly advance our understanding and conservation of these species. This work provides an initial genome assembly based on Illumina short-read sequencing. The correction and scaffolding of the contigs were made with the RagTag program using the red deer PacBio-based chromosome-level assembly. The potential benefits of this work are vast, from gaining knowledge to initiating and furthering population studies to preserve the species. According to the annotation and the BUSCO analysis, the final assembly seems especially good, considering that it is short-read based. However, there are some concerns about the methodology and the provided data. 1. The Illumina short reads and the annotation data (GFFs, VCFs) are not available. 2. The methods used are not reproducible because the descriptions of the exact parameters are missing. 3. It seems that the authors did not use the ‘-r’ parameter during the scaffolding, which resulted in inserting 100bp Ns instead of the actual size insertion based on the red deer reference genome. 4. There is no K-mer based genome size estimation. 5. The chromosome number is not known. Is there any chromosomal rearrangement between the red deer and the Visayan Spotted Deer? 6. It is not justified why the protein- and mitochondria-based trees are drawn as cladograms and not as phylograms. This way, the actual distances between the different species cannot be seen. 7. Although the short reads were mapped back to the assembly, no variation data is provided. 8. Is it necessary to include these high number (46104) short (1000>) contigs in the assembly? 9. Although the red deer assembly was used for the correction and scaffolding, the annotation was compared to the mule deer.
  
  Re-review: I thank the authors for their efforts to address the concerns raised. I broadly agree with the answers, but three further details need clarification: 1. Calculating the raw reads and the resulting genome size yields a coverage of about 62x. The authors mapped the raw reads back to the resulting reference genome sequence, which gave 47x coverage. However, both Genomescope and Merqury K-mer analysis showed 22x coverage. What is the reason for this discrepancy? 2. The K-mer analysis does indeed, and a bit strangely, show what appears to be a haploid genome. However, the 0.302% heterozygosity measured by GenomeScope is not remarkably low. To have an accurate picture of this, it would be important to count the number of heterozygous sites based on the raw reads mapped back at 47x coverage. 3. Although we do not know the exact chromosome number, fitting the reference to the red deer reference could be interesting. It would show how many scaffolds fit more than one red deer chromosome. Of course, this could be either due to chromosome rearrangement or because the contigs' scaffolding or assembly was incorrect.
  
  Reviewer 2. Haimeng Li
  
  Are all data available and do they match the descriptions in the paper?
  
  No. The genomic annotation file is not publicly available.
  
  Are the data and metadata consistent with relevant minimum information or reporting standards?
  
  No. Genomic annotation information and protein sequence information were not found in the NCBI database.
  
  Is there sufficient detail in the methods and data-processing steps to allow reproduction? No.
  
  Is there sufficient information for others to reuse this dataset or integrate it with other data? No.
  
  Additional Comments:
  
  The manuscript, 'Draft Genome of the Endangered Visayan Spotted Deer (Rusa alfredi), a Philippine Endemic Species,' contributes to the field of conservation genomics. The study presents the first draft genome assembly of the Visayan Spotted Deer, utilizing Illumina short-read sequencing technology to generate valuable genomic resources for this endangered species. Here are some questions and comments.
  
  Q1. Why was gene annotation conducted using only homology-based annotation? It is recommended that the annotation approach includes de novo, RNA-based, and homology-based methods. Combining these approaches would provide a more comprehensive gene set, particularly for species with limited genomic resources. Please revise the method section to include these additional annotation strategies. The authors have stated that due to sampling limitations, RNA-based experiments could not be conducted. RNA extraction might be performed using the tissue samples that were previously collected for genome assembly. In Lines 167-172 Q2. Before proceeding with genome assembly, it is essential to conduct a genome survey. This initial step provides crucial information about the genome's size, complexity, and composition, which is vital for planning the assembly strategy and selecting appropriate sequencing technologies and bioinformatics tools. The survey should include estimates of genome size, GC content, repetitive elements, and ploidy level. Additionally, the result could be used to assess the completeness of the assembly. Please include a section on the genome survey in the Method section. Q3. To enhance the quality and contiguity of the assembly, utilizing another species as a reference genome for scaffolding might introduce errors due to discrepancies in karyotype. It is essential to ascertain whether there is a definitive karyotype study that verifies the consistency of the karyotype between the Visayan Spotted Deer and the reference species, indicating the absence of chromosomal fission or fusion events. In Lines 236-238 This information is crucial for the reliability of the scaffolding process. Q4. Although the length of scaffold N50 is long, the high number of scaffolds and contigs suggests fragmentation. Have you addressed redundancy in the assembly? In Line 238 Q5. Have you used software like Merqury to detect assembly errors and assess the completeness of the assembly? This is useful for evaluating the quality of the genome sequence and identifying potential issues that may need to be addressed. Q6. Are the species divergent, which might explain the low number of orthologous genes? Is this an annotation issue or does it reflect true biological divergence? Further investigation into the annotation process and comparative genomic analyses may be warranted to understand the extent of divergence and the implications for the study. In Lines 313-317 Q7. Please standardize the format of numbers throughout the manuscript to maintain consistency in the number of significant figures. In Lines 224, 225, 227, 239, 245
  
  Re-review: Q1：Why is the estimated genome size from the genome survey much smaller than the assembled genome size? Q2:In the method section, I did not see a description of the de novo method for gene structure annotation. Q3:I am concerned about using a reference genome with unclear karyotype relationships for scaffolding. Q4:Are there other published comparative genomic studies on deer that have identified such a small number of homologous genes?
Visit annotations in context

Tags

Summary

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2025.02.05.636739v1
www.biorxiv.org www.biorxiv.org

The assembly and annotation of two teinturier grapevine varieties, Dakapo and Rubired

2
1. GigaScience 17 Mar 2025
  
  in GigaByte
  
  Editors Assessment:
  
  Teinturier grapes produce berries with pigmented skin and flesh, and are used in red wine blends, as they provide a deeper colour. This paper presents the genomes of two popular teinturier varieties (Dakapo and Rubired); sequenced, assembled, and annotated to provide additional resources for their use in breeding. Combining Nanopore and Illumina sequencing for Dakapo, scaffolding to the existing grapevine assembly to generate a final assembly of 508.5 Mbp and 36,940 gene annotations. For Rubired PacBio HiFi reads were assembled, scaffolded, and phased to generate a diploid assembly with two haplotypes 474.7-476.0 Mbp long and 56,681 genes annotated. Peer review has helped validate their high quality, these genomes hopefully enabling more insight into the genetics of grapevine berry colour and their other traits like frost and mildew-resistance.
  
  This evaluation refers to version 1 of the preprint
  
  Summary
2. GigaScience 17 Mar 2025
  
  in GigaByte
  
  ABSTRACTBackground Teinturier grapevine varieties were first described in the 16th century and have persisted due to their deep pigmentation. Unlike most other grapevine varieties, teinturier varieties produce berries with pigmented flesh due to anthocyanin production within the flesh. As a result, teinturier varieties are of interest not only for their ability to enhance the pigmentation of wine blends but also for their health benefits. Here, we assembled and annotated the Dakapo and Rubired genomes, two teinturier varieties.Findings For Dakapo, we used a combination of Nanopore sequencing, Illumina sequencing, and scaffolding to the existing grapevine genome assembly to generate a final assembly of 508.5 Mbp with an N50 scaffold length of 25.6 Mbp and a BUSCO score of 98.0%. A combination approach of de novo annotation and lifting over annotations from the existing grapevine reference genome resulted in the annotation of 36,940 genes in the Dakapo assembly. For Rubired, PacBio HiFi reads were assembled, scaffolded, and phased to generate a diploid assembly with two haplotypes 474.7-476.0 Mbp long. The diploid genome has an N50 scaffold length of 24.9 Mbp and a BUSCO score of 98.7%, and both haplotype-specific genomes are of similar quality. De novo annotation of the diploid Rubired genome yielded annotations for 56,681 genes.Conclusions The Dakapo and Rubired genome assemblies and annotations will provide genetic resources for future investigations into berry flesh pigmentation and other traits of interest in grapevine.
  
  This work has been published in GigaByte Journal under a CC-BY 4.0 license (https://doi.org/10.46471/gigabyte.149). These reviews (including a protocol review) are as follows.
  
  Reviewer 1. Camille Rustenholz
  
  Is there sufficient detail in the methods and data-processing steps to allow reproduction? No. Overall, the authors give enough details except for the haplotypes of Chardonnay, Pinot noir, Cabernet sauvignon and Cabernet franc that were used for Figure 3.
  
  Is the validation suitable for this type of data? No. Overall, the authors provide accurate validation for this type of data except for the inversion that was identified on chromosome 10 of Dakapo assembly. In my opinion, more evidences need to be provided as Dakapo contigs were anchored using PN40024 12X.v2 assembly version. There is indeed a heterozygous region at the beginning of chromosome 10 in PN40024 genome which makes its assembly and scaffolding quality quite doubtful at that exact location and especially for this assembly version. I would suggest to check it using the latest PN40024 T2T version (Shi et al., Hort Res 2023) and to show some Dakapo short read alignments against its own assembly to validate the borders of this inversion, even though some wet lab validation would be even more convincing.
  
  Additional Comments: The authors provided the assemblies and gene annotations of the genomes of two teinturier varieties, Dakapo and Rubired. Dakapo was assembled using a combination of Nanopore and Illumina reads whereas Rubired was assembled using PacBio HiFi reads. Even though both assemblies are of high quality, quality metrics are better for Rubired assembly than for Dakapo assembly, in terms of contiguity and of phasing. I would have liked the authors to comment and explain these differences more extensively maybe in a dedicated paragraph in the Discussion section: - Why Dakapo assembly could not be phased? - Are these differences in terms of quality due to the sequencing technologies (Nanopore versus PacBio HiFi) used? Or to different year of dataset acquisition? Or to assembly methods? Both assemblies were also annotated: 36,940 genes in the Dakapo assembly and 56,681 genes in the diploid Rubired. I assume that 56,681 is the sum of the number of genes annotated on haplotype 1 and haplotype 2 of Rubired. If so, it needs to be clearly stated line 328 otherwise it can be confusing for the reader who will think that Rubired has 20,000 more genes than Dakapo. Also, the authors used two different annotation pipelines, which complicates the gene content comparison and the synteny analysis later on. I would have liked the authors to comment and explain it: - Is it due to the difference in the quality of the assemblies? If so, the authors need to highlight the limits of their annotation pipeline regarding assembly quality. - Any other explanation? Some minor suggestions : - Line 74: please use the word “clone” in the sentence for a matter of clarity. - Line 292-293: PN40024.v4 assembly is not the most recent but the PN40024 T2T is (Shi et al., Hort Res, 2023) The quality of the assemblies and annotations are very good and the resources of the paper will be very valuable for the grapevine community, especially to study the anthocyanin production in grapevine.
  
  Reviewer 2. Andrea Gschwend
  
  Are all data available and do they match the descriptions in the paper? No. The supplementary files were not made available to me for review.
  
  Is there sufficient detail in the methods and data-processing steps to allow reproduction?
  
  I recommend including additional details for the programs used for the Rubired genome assembly and annotation in this manuscript, though.
  
  Is there sufficient data validation and statistical analyses of data quality? No. It is unclear from the manuscript if the large Dakapo inversion was validated experimentally. See additional comments from the uploaded word document https://gigabyte-review.rivervalleytechnologies.comdownload-api-file?ZmlsZV9wYXRoPXVwbG9hZHMvZ3gvRFIvNTQ1L1JpdHRlcl9ldF9hbC5fMjAyNF9HaWdhYnl0ZV9yZXZpZXdlcl9jb21tZW50c184LTIzLTI0LmRvY3g=
  
  Reviewer 3. Yongfeng Zhou and Kekun Zhang
  
  Are all data available and do they match the descriptions in the paper? No. Is there sufficient data validation and statistical analyses of data quality? No. Is there sufficient information for others to reuse this dataset or integrate it with other data? No. Additional Comments: My main concerns: 1. Please explain why different sequencing methods were chosen for the genome assembly of Dakapo and Rubired, given that HiFi sequencing is currently mainstream and provides more accurate assembly? 2. Recently, the T2T level genome of many grape cultivars has been assembled including the reference genome PN_T2T and the teinturier grape Yan73, Please align with the latest complete reference genome PN_T2T in Line 172, and add the genome information about PN_T2T and Yan73 in Table 1. ( DOI10.1093/hr/uhad061, DOI10.1093/hr/uhad205 ) 3. Line 387-389: How did you verify the correctness of this inversion? Is it contained within a single contig without orientation or assembly errors in the Dakapo genome? Have you identified any other genomes with this inversion? 4. Line 255: can you explain why is the contig N50 so low? 5. Line 328: whether the total number of annotated genes in the two Rubired haplotypes are all 56,681? it would be more appropriate to describe them separately. 6. The phenotypes of these two grapes should be included, not just in the pattern diagram. 7. The sequence difference in Figure 2 should be verified using other methods, such as PCR results and Sanger sequencing.
Visit annotations in context

Tags

Summary

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2024.05.03.592245v2
www.biorxiv.org www.biorxiv.org

SqueezeCall: Nanopore basecalling using a Squeezeformer network

2
1. GigaScience 15 Mar 2025
  
  in GigaByte
  
  Editors Assessment:
  
  The accuracy of basecalling of nanopore sequencing still needs to be improved. With recent advances in deep learning this paper introduces SqueezeCall, a novel end-to-end tool for accurate basecalling. This uses Squeezeformer-achitecture which integrates local context extraction through convolutional layers and long-range dependency modeling via global context acquisition. Testing and peer review demonstrated that SqueezeCall outperformed traditional RNN and Transformer-based basecallers across multiple datasets, indicating its potential to refine genomic assembly and facilitate direct detection of modified bases in future genomic analytics. Future work is ongoing that will focus on training on highly curated datasets, including known modifications, to further increase research value. SqueezeCall is MIT licensed and available from GitHub here: https://github.com/labcbb/SqueezeCall
  
  This evaluation refers to version 1 of the preprint
  
  Summary
2. GigaScience 15 Mar 2025
  
  in GigaByte
  
  ABSTRACTNanopore sequencing, a novel third-generation sequencing technique, offers significant advantages over other sequencing approaches, owing especially to its capabilities for direct RNA sequencing, real-time analysis, and long-read length. During nanopore sequencing, the sequencer measures changes in electrical current that occur as each nucleotide passes through the nanopores. A basecaller identifies the base sequences according to the raw current measurements. However, due to variations in DNA and RNA molecules, noise from the sequencing process, and limitations in existing methodology, accurate basecalling remains a challenge. In this paper, we introduce SqueezeCall, a novel approach that uses an end-to-end Squeezeformer-based model for accurate nanopore basecalling. In SqueezeCall, convolution layers are used to down sample raw signals and to model local dependencies. A Squeezeformer network is employed to capture the global context. Finally, a connectionist temporal classification (CTC) decoder generates the DNA sequence by a beam search algorithm. Inspired by the Wav2vec2.0 model, we masked a proportion of the time steps of the convolution outputs before feeding them to the Squeezeformer network and replaced them with a trained feature vector shared between all masked time steps. Experimental results demonstrate that this method enhances our model’s ability to resist noise and allows for improved basecalling accuracy. We trained SqueezeCall using a combination of three types of loss: CTC-CRF loss, intermediate CTC-CRF loss, and KL loss. Ablation experiments show that all three types of loss contribute to basecalling accuracy. Experiments on multiple species further demonstrate the potential of the Squeezeformer-based model to improve basecalling accuracy and its superiority over a recurrent neural network (RNN)-based model and Transformer-based models.
  
  This work has been published in GigaByte Journal under a CC-BY 4.0 license (https://doi.org/10.46471/gigabyte.148). These reviews (including a protocol review) are as follows.
  
  Reviewer 1. Tao Jiang
  
  In this study, Zhongxu ZHU presents a novel approach combining the Squeezeformer architecture with masking techniques for nanopore basecalling, demonstrating meaningful improvements over existing methods. However, several concerns need to be addressed before publication. 1. The rationale behind the chosen hyperparameter values (e.g., mask_time_prob = 0.05 and mask_time_length = 5) is unclear. Did the authors experiment with other hyperparameter settings? If so, please provide results or justification for selecting these specific values. 2. The signal preprocessing methodology would benefit from a more detailed explanation. Specifically, the current description should clarify whether standard signal normalization techniques were applied to the raw current signals and detail any FFT preprocessing steps. Since nanopore sequencing signals can vary significantly between different species and experimental runs, explaining how SqueezeCall handles these variations would help other researchers implement and potentially improve upon this work. The author could consider including a flowchart or detailed pseudocode of the preprocessing pipeline. 3. A more detailed analysis of the model's error handling would strengthen the paper. Specifically, how effectively does SqueezeCall address key challenges in nanopore sequencing, such as homopolymer errors? 4. The manuscript requires attention to detail in presentation,such as: I) In Table 1, the mismatch rate (3.68) for the NA12878 Human Dataset is partially bolded, which should be corrected for consistency. II) On page 12, line 19, there is an unnecessary "e.g." before "SqueezeCall," which should be removed. 5. Instances of "Error! Reference source not found" are present in the manuscript. Please resolve these citation errors to ensure clarity and credibility.
  
  Re-review: The revised manuscript addresses most of my concerns; however, I have a few additional suggestions before recommending it for publication: 1) The newly added experimental Mask module presents only the results. Charts should be included to provide a more intuitive and visual representation of these results. 2) The images included in the Response should also be incorporated into the main text or published as supplementary materials alongside the manuscript. 3) The formulas in the manuscript are missing corresponding numbers. It is recommended to add numbers to each formula for clarity and ease of reference.
  
  Reviewer 2. Ximei Luo
  
  This manuscript describes a tool called SqueezeCall, designed for accurate nanopore basecalling. The authors compare SqueezeCall with four existing basecalling methods across 11 different datasets and report that it outperforms them in terms of basecalling accuracy. However, the study has several shortcomings and requires further clarification. Below are my comments. 1) The current discussion and conclusion section lacks sufficient analysis of the scientific and practical value of the proposed algorithm for nanopore sequencing. To strengthen the manuscript, consider expanding the conclusion section to provide a detailed discussion on the practical applications of the tool in real-world nanopore sequencing workflows. Additionally, include potential directions for further improvement of the algorithm to inspire future research and development in this area. 2) The figures in the manuscript are blurry and should be improved for clarity. Additionally, the layout requires better structuring and alignment, ensuring that the borders are neat and consistent. Efforts should be made to enhance the visual appeal of the figures, and the accompanying descriptions should provide sufficient detail to enable readers to understand the content by reviewing the figures alone. 3)To enhance the showcasing of SqueezeCall's superiority, it is advisable to include one or two of the latest methods for comparison.
  
  Minor comments: 1) There are instances of missing punctuation marks in sentences throughout the article. For example, the sentence on page 3, line 9, is missing a period at the end. 2) Address the "Reference not found" issues that appear in several places in the manuscript. 3) Number all formulas in the manuscript for easier reference and citation. 4) Verify that all references are complete and formatted according to the target journal's guidelines. 5) Some areas in Table 1 that necessitate emphasis through bold formatting are inaccurately labeled. 6) Certain content in Figure 1 and Figure 2 appears redundant; consolidation is recommended to streamline the visuals.
  
  Reviewer 3. Yongtian Wang
  
  The manuscript presents SqueezeCall, an innovative approach that combines Squeezeformer architecture with masking techniques for nanopore basecalling. The work demonstrates promising accuracy improvements through comprehensive evaluation across multiple datasets, including human, lambda phage, and nine bacterial datasets. The architecture thoughtfully integrates convolution layers for signal downsampling, employs a Squeezeformer network for capturing global context, and introduces a novel masking technique inspired by Wav2vec2.0. While the research direction and initial results are valuable, several aspects could be strengthened to enhance the work's impact: 1. Several formatting inconsistencies in the manuscript require attention for improved clarity. In Table 1, the mismatch rate (3.68) for the NA12878 Human Dataset is partially bolded, which affects the table's readability. On page 12, line 19, the redundant "e.g." before "squeezecall" should be removed. The citation system needs review as multiple instances of "Error! Reference source not found" appear throughout. 2. The mask hyperparameter selection (mask_time_prob = 0.05 and mask_time_length = 5) requires empirical justification. Including ablation studies showing model performance with different masking probabilities (e.g., 0.01, 0.03, 0.07, 0.1) and lengths (e.g., 3, 7, 10) would provide valuable insights. This analysis could reveal whether the chosen values are optimal or if there's room for improvement. A visualization of how different masking parameters affect model performance could be particularly instructive. 3. The error analysis could be expanded to provide deeper technical insights. The author should particularly analyze the distribution of skip and stay errors in homopolymer regions (e.g., AAAAA or GGGGG) where nanopore basecalling typically struggles. 4. The manuscript would benefit from exploring modified base calling capabilities. The author could train and evaluate the model on datasets containing known DNA modifications (e.g., 5mC, 6mA). This could start with synthetic sequences containing known modifications and extend to well-characterized genomic regions. Even if full modified base calling is beyond the current scope, preliminary results or architectural considerations for future extension would be valuable.
Visit annotations in context

Tags

Summary

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2025.01.21.634194v1
www.biorxiv.org www.biorxiv.org

xRead: a coverage-guided approach for scalable construction of read overlapping graph

2
1. GigaScience 03 Mar 2025
  
  in GigaScience
  
  AbstractThe development of long-read sequencing is promising to high-quality and comprehensive de novo assembly for various species around the world. However, it is still challenging for genome assemblers to well-handle thousands of genomes, tens of gigabase level genome sizes and terabase level datasets simultaneously and efficiently, which is a bottleneck to large de novo sequencing studies. A major cause is the read overlapping graph construction that state-of-the-art tools usually have to cost terabyte-level RAM space and tens of days for that of large genomes. Such lower performance and scalability are not suited to handle the numerous samples to be sequenced. Herein, we propose xRead, an iterative overlapping graph approach that achieves high performance, scalability and yield simultaneously. Under the guidance of its novel read coverage-based model, xRead uses heuristic alignment skeleton approach to implement incremental graph construction with highly controllable RAM space and faster speed. For example, it enables to process the 1.28 Tb A. mexicanum dataset with less than 64GB RAM and obviously lower time-cost. Moreover, the benchmarks on the datasets from various-sized genomes suggest that it achieves higher accuracy in overlap detection without loss of sensitivity which also guarantees the quality of the produced graphs. Overall, xRead is suited to handle numbers of datasets from large genomes, especially with limited computational resources, which may play important roles in many de novo sequencing studies.
  
  This work has been peer reviewed in GigaScience (https://doi.org/10.1093/gigascience/giaf007), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer #2: Anuradha Wickramarachchi
  
  Overall comments.
  
  Authors of the manuscript have developed an iterative overlap graph construction algorithm to support genome assembly. This is both an interesting and a demanding area of research due to very recent advancements in sequencing technologies.
  
  Although the text in the manuscript is interesting, grammar must be rechecked and revised. At some point it is difficult to keep track of the content and references to supplementary to make sense out of the content.
  
  Specific comments
  
  Page 1 Line 13: I believe the authors are talking about assembly sizes and not genome sizes. The sentences here could be a bit short to make them easy to understand.
  
  Page 2 Line 19: Theoretical time complexity O(m2n2) is bit of an overstatement due to the heuristics employed by most assemblers. For example, mash distance, minimisers and k-mer bins are there to prevent this explosion of complexity. Either acknowledge such methods or provide a range for the time complexity. I would be interesting to know the time complexities of the methods expressed in sentence starting Line 15.
  
  Page 5 Line 11: Was this performed with overlapping windows of 1gb? Otherwise, simulations may not have reads spanning across such regions.
  
  Page 5 Line 14: It seems you are simulating 9 + 4 + 4 datasets. This is unclear, please make this into bullet points or separate paragraphs and explain clearly. Include simulator information in the table itself by may be making it landscape (in supplementary).
  
  Fig 2: I believe authors should expand their analysis to more recent and popular assemblers. For example, wtdbg2 is designed for noisy reads and not specifically for more accurate R10/ HiFi reads. So please include, HiFi-asm, Flye where appropriate. Flye supports ONT out of the box and in my experience does produce good assemblies.
  
  Although, you are evaluating read overlaps, it is hard to ignore assemblers themselves just because they do not produce intermediate overlaps graphs.
  
  Page 5-9: In the benchmarks section, please include how True Positives and False Positives were labelled. Was this from simulation data?
  
  Page 11: Use of xRead has been evaluated on genome assemblies. This is a very important and it is a bit unfortunate that existing assemblers are not very flexible in terms of plugging in new intermediate steps. It might be worth exploring into creating a new assembler using the wtpoa2 cli command of wtdbg2.
  
  Page 16: What will happen if you only capture reads from a single chromosome due to longer length? I believe the objective is to gather longest reads capturing as much as possible covering the whole genome. Please comment on this.
  
  Page 19: In the Github Readme the download URL was wrong. Please correct it to the latest release
  
  Correct: https://github.com/tcKong47/xRead/releases/download/xRead-v1.0.0.1/xRead-v1.0.0.tar.gz Existing: https://github.com/tcKong47/xRead/releases/download/v1.0.0/xRead-v1.0.0.tar.gz
  
  Make command failed with make: *** No rule to make target main.h', needed bymain.o'. Stop.
  
  It seems the release does not have source code, but rather the compiled version. Please update github instructing how to compile code properly with a git clone.
2. GigaScience 03 Mar 2025
  
  in GigaScience
  
  AbstractThe development of long-read sequencing is promising to high-quality and comprehensive de novo assembly for various species around the world. However, it is still challenging for genome assemblers to well-handle thousands of genomes, tens of gigabase level genome sizes and terabase level datasets simultaneously and efficiently, which is a bottleneck to large de novo sequencing studies. A major cause is the read overlapping graph construction that state-of-the-art tools usually have to cost terabyte-level RAM space and tens of days for that of large genomes. Such lower performance and scalability are not suited to handle the numerous samples to be sequenced. Herein, we propose xRead, an iterative overlapping graph approach that achieves high performance, scalability and yield simultaneously. Under the guidance of its novel read coverage-based model, xRead uses heuristic alignment skeleton approach to implement incremental graph construction with highly controllable RAM space and faster speed. For example, it enables to process the 1.28 Tb A. mexicanum dataset with less than 64GB RAM and obviously lower time-cost. Moreover, the benchmarks on the datasets from various-sized genomes suggest that it achieves higher accuracy in overlap detection without loss of sensitivity which also guarantees the quality of the produced graphs. Overall, xRead is suited to handle numbers of datasets from large genomes, especially with limited computational resources, which may play important roles in many de novo sequencing studies.
  
  This work has been peer reviewed in GigaScience (https://doi.org/10.1093/gigascience/giaf007), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer #1: Antoine Limasset
  
  The manuscript describes xreads, a novel method that enables resource-efficient overlap graph computation based on new strategies to compute it quickly and with controlled memory usage. The authors introduce several quality metrics to assess the quality of the overlap graph and integrate their tool into NextDenovo, improving its resource usage.
  
  The manuscript is overall clear, although the section order can make it hard to read as concepts are defined backward. Some typos and minor phrasing issues should be corrected.
  
  Remarks:
  
  The manuscript spends a lot of time evaluating the quality of the overlap graph, which is a very commendable approach and is often overlooked. I thank the authors for this contribution. However, I have issues with the definition of ground truth overlap. Even if two reads do not come from successive parts of the genome, if they share, let's say, a very large perfect overlap, they should indeed overlap in the graph. Considering that the actual biological overlap is necessarily the best one found in the reads is a greedy strategy that could harm the final assembly. Because of this definition, I am not fully convinced by xreads' performance, which seems to employ an overall very greedy strategy.
  
  A key selling point of the abstract is the ability of xreads to work with controlled memory usage at the expense of time and external memory usage. Showing some results on this feature would be very interesting, such as a plot showing the time performance depending on memory usage, for example. Also, the amount of external memory used should be discussed.
  
  As far as I understand, the end goal of xreads is to perform efficient de novo assembly. The assembly results should be the primary results of the manuscript and not relegated to the supplementary section. The assembly benchmark should include other assemblers and not only NextDenovo. The assembly results and justification are not quite convincing since the proposed assembler is slightly more resource-efficient at the cost of degraded assembly quality. While the case studies are interesting, it is hard to avoid concluding that the overall quality is degraded compared to regular NextDenovo.
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2023.05.23.541864v1
www.biorxiv.org www.biorxiv.org

Genomic evidence for hybridization and introgression between blue peafowl and green peafowl and selection for white plumage

2
1. GigaScience 03 Mar 2025
  
  in GigaScience
  
  AbstractThe blue peafowl (Pavo cristatus) and the green peafowl (Pavo muticus) have significant public affection due to their stunning appearance, although the green peafowl is currently endangered. Some studies have suggested introgression between these the two species, although evidence is mixed. In this study, we successfully assembled a high-quality chromosome-level reference genome of the blue peafowl, including the autosomes, Z and W sex chromosomes as well as a complete mitochondria DNA sequence. Data from 77 peafowl whole genomes, 76 peafowl mitochondrial genomes and 33 peahen W chromosomes genomes provide the first substantial genetic evidence for recent hybridization between green and blue peafowl. We found three hybrid green peafowls in zoo samples rather than in the wild samples, with blue peafowl genomic content of 16-34%. Maternal genetic analysis showed that two of the hybrid female green peafowls contained complete blue peafowl mitochondrial genomes and W chromosomes. Hybridization of endangered species with its relatives is extremely detrimental to conservation. Some animal protection agencies release captive green peafowls in order to maintain the wild population of green peafowls. Therefore, in order to better protect the endangered green peafowl, we suggest that purebred identification must be carried out before releasing green peafowls from zoos into the wild in order to preventing the hybrid green peafowl from contaminating the wild green peafowl. In addition, we also found that there were historical introgression events of green peafowl to blue peafowl in four Zoo blue peafowl individuals. The introgressed genomic regions contain IGFBP1 and IGFBP2 genes that could affect blue peafowl body size. Finally, we identified that the nonsense mutation (g.4:12583552G>A) in the EDNRB2 gene is the genetic causative mutation for white feather color of blue peafowl (also called white peafowl), which prevents melanocytes from being transported into feathers, such that melanin cannot be deposited.
  
  This work has been peer reviewed in GigaScience (https://doi.org/10.1093/gigascience/giae124), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer #2: Subhradip Karmakar
  
  I read with interest the manuscript " Genomic evidence for hybridization and introgression between blue peafowl and endangered green peafowl and molecular Foundation of peafowl white plumage" by Lujiang et al. . This is a well-drafted, well-executed study that investigated the effect of introgression in shaping the genomic diversity landscape of peafowl. I am glad the authors undertook this much-needed study which is so critical from an evolutionary point of view. I have few queries and clarifications needed : 1. Fig S21 : Manhattan Plot : What is the loci on Chr 4 & Chr 6 that showed above threshold? What are the consequences of IL12b and IL25 ? 2. Page 50, Line : 929 : " The genes (IGF2BP3, TGBR1, ISPD, MEOX2, GLI3 and MC4R) related to body size in blue peafowl were also found to have introgression areas from green peafowl" What is the evidence for this ? Were these genes absent before the introgression events in blue peafowl? What are the modifications of IGFBP after introgression? Is it under positive selection? If yes why 3. There is not much discussion on Fig S 22 ( Suppl) on the KEGG Pathway hits. What is the significance of ribosome biogenesis? Protein processing in ER, etc 4. The white peafowls were homozygous for the mutant (A/A), resulting in the loss of EDNRB2 transcript. What is the reason for this mutant gene's fixation in white plumage birds? 5. The images, almost all of them, appear very hazy and blurry. It may be an issue with my computer. Please recheck 6. Please elaborate on the significance of IL6 and other immune-related genes in the discussion.
2. GigaScience 03 Mar 2025
  
  in GigaScience
  
  AbstractThe blue peafowl (Pavo cristatus) and the green peafowl (Pavo muticus) have significant public affection due to their stunning appearance, although the green peafowl is currently endangered. Some studies have suggested introgression between these the two species, although evidence is mixed. In this study, we successfully assembled a high-quality chromosome-level reference genome of the blue peafowl, including the autosomes, Z and W sex chromosomes as well as a complete mitochondria DNA sequence. Data from 77 peafowl whole genomes, 76 peafowl mitochondrial genomes and 33 peahen W chromosomes genomes provide the first substantial genetic evidence for recent hybridization between green and blue peafowl. We found three hybrid green peafowls in zoo samples rather than in the wild samples, with blue peafowl genomic content of 16-34%. Maternal genetic analysis showed that two of the hybrid female green peafowls contained complete blue peafowl mitochondrial genomes and W chromosomes. Hybridization of endangered species with its relatives is extremely detrimental to conservation. Some animal protection agencies release captive green peafowls in order to maintain the wild population of green peafowls. Therefore, in order to better protect the endangered green peafowl, we suggest that purebred identification must be carried out before releasing green peafowls from zoos into the wild in order to preventing the hybrid green peafowl from contaminating the wild green peafowl. In addition, we also found that there were historical introgression events of green peafowl to blue peafowl in four Zoo blue peafowl individuals. The introgressed genomic regions contain IGFBP1 and IGFBP2 genes that could affect blue peafowl body size. Finally, we identified that the nonsense mutation (g.4:12583552G>A) in the EDNRB2 gene is the genetic causative mutation for white feather color of blue peafowl (also called white peafowl), which prevents melanocytes from being transported into feathers, such that melanin cannot be deposited.
  
  This work has been peer reviewed in GigaScience (https://doi.org/10.1093/gigascience/giae124), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer #1: Huirong Mao
  
  The authors had finished very systematic and comprehensive research. They obtained a high-quality chromosome-level reference genome of the blue peafowl, including the autosomes, Z and W sex chromosomes as well as a complete mitochondria DNA sequence by combined several sequencing technologies（HiFi sequencing and Hi-C sequencing）. Based on this, they further confirmed the evidence of introgression between blue peafowl and green peafowl. In addition, it is finding the nonsense mutation (g.4:12583552G>A) in the EDNRB2 gene as the causative mutation for white feather color of blue peafowl that identifies an important gap on the genetic mechanism of the white plumage in the peafowl. Overall, The results and resources obtained from this study are valuable further comparative genomic studies in birds. The analyses are also sound and comprehensive. However, before considering acceptance, there are some questions and clarifications needed from the authors to fully substantiate the findings and their implications. i) The "Results" section of the paper contains extensive analysis and discussion, which overlaps significantly with the "Discussion" section. It is recommended to consolidate and streamline these sections. ii) The authors used 'white feather' peafowl throughout the manuscript. Actually there are scientific terms about these color abnormality, for instance, leucism or albino plumage. Please define whether your samples from leucitic or albino populations. Also please change the term 'white feather' throughout the manuscript. iii) The authors used three types of data (one-to-one orthologs datasets, four-fold degenerate sites datasets and mitochondrial sequence datasets) to study the genetic relationships between peacocks, chickens, and turkeys, and proved that the genetic distance between peacocks and chickens is closer (See Line 859-862). However, from the results section, in Figure 1C, the pattern of tree3 shows that the genetic distance between peacocks and turkeys appears to be closer, suggesting a certain contradiction between the results and the discussion sections. iv) Why were individuals with the "pied" phenotype not selected as controls for the corresponding transcriptomic study to validate the molecular mechanisms of feather formation in blue peacocks using RNA-Seq results? v) The statement in the sentence "Compared with the peafowl, the ROH length of all peafowl populations is short and the total is small (see Line 624-625)" seems to be incorrect. vi) The entire paper still needs further improvement in terms of writing norms and grammar. （eg. Line 642, "as an outgroup", Line 647 "The mitochondrial phylogenetic" etc ）
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2023.12.27.573425v1
www.biorxiv.org www.biorxiv.org

TransHLA: A Hybrid Transformer Model for HLA-Presented Epitope Detection

2
1. GigaScience 03 Mar 2025
  
  in GigaScience
  
  AbstractBackground Precise prediction of epitope presentation on human leukocyte antigen (HLA) molecules is crucial for advancing vaccine development and immunotherapy. Conventional HLA-peptide binding affinity prediction tools often focus on specific alleles and lack a universal approach for comprehensive HLA site analysis. This limitation hinders efficient filtering of invalid peptide segments.Results We introduce TransHLA, a pioneering tool designed for epitope prediction across all HLA alleles, integrating Transformer and Residue CNN architectures. TransHLA utilizes the ESM2 large language model for sequence and structure embeddings, achieving high predictive accuracy. For HLA class I, it reaches an accuracy of 84.72% and an AUC of 91.95% on IEDB test data. For HLA class II, it achieves 79.94% accuracy and an AUC of 88.14%. Our case studies using datasets like CEDAR and VDJdb demonstrate that TransHLA surpasses existing models in specificity and sensitivity for identifying immunogenic epitopes and neoepitopes.Conclusions TransHLA significantly enhances vaccine design and immunotherapy by efficiently identifying broadly reactive peptides. Our resources, including data and code, are publicly accessible at https://github.com/SkywalkerLuke/TransHLA
  
  This work has been peer reviewed in GigaScience (https://doi.org/10.1093/gigascience/giaf008), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 2: Markus Müller
  
  The authors present TransHLA, a deep learning tool to predict whether a peptide is an HLA binder or not. They use the ESM2 language model to create peptide embeddings for structural and sequence features and then use transformers and CNNs for the binding prediction. The article is well-written and clear. However, the authors must better justify the choice of their model and its potential application.
  
  Major comments:
  
  1) In personalized medicine, the HLA alleles of a patient can be obtained via WES and there is no need for such a HLA agnostic binding predictor. Could you briefly outline the most important medical applications where your TransHLA predictor could be most useful?
  
  2) Could you give more information about your IEDB training set? What are the frequencies of the HLA alleles, and the number of peptides per allele? How did you perform the splits into training, validation, and test sets? Were peptides from the same allele all present in all 3 sets? How does TransHLA perform for peptides binding to alleles not present in the training set compared to peptides binding to alleles present in the training set? How does the performance depend on the number of peptides of the allele in the training set? Is the model biased to these frequent alleles?
  
  3) Peptides are processed by many steps before being presented on HLA molecules. These include cleavage in the proteasome, transport via TAP to the ER, cleavage by ERADs and finally loading on the HLA complex. Why don't you perform your study on extended peptide sequences, where you take into account several amino acids before and after the peptide termini? Like this, you could also include the other processing steps. It would be interesting to see whether this sequence extension would improve prediction.
  
  4) Could you compare your approach with a 'simpler' approach, where you calculate all biopython features (such as flexibility), ev. choose the n most informative ones by feature selection, and use a standard classifier such as logistic regression or XGBoost to predict the HLA binding. This method has the advantage that it tells you directly which features are most relevant.
  
  5) Please provide the results of the ablation study in a table in the main text, where you compare the ablated models to the base model.
  
  6) Could you briefly explain what the different terms in the TIM loss are and why they are important?
  
  7) Does the flexibility depend on the length of the peptides? Peptides longer than 10 often bulge out of the binding groove, and naively one would expect them to be less stiff than peptides of length 8 or 9.
  
  Minor:
  
  1) In Equation 10, please define ^p_k. In the text, you use T for the number of classes, in the formulae K.
2. GigaScience 03 Mar 2025
  
  in GigaScience
  
  TransHLA: A Hybrid Transformer Model for HLA-Presented Epitope Detection
  
  This work has been peer reviewed in GigaScience (https://doi.org/10.1093/gigascience/giaf008), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  **Reviewer 1: Georgios Fotakis **
  
  1) General Comments In this manuscript, the authors present TransHLA, a hybrid transformer model that integrates a transformer-based language model with a deep Convolutional Neural Network (CNN) module. The transformer encoder module leverages a pre-trained large language model (Evolutionary Scale Modeling - ESM2) to extract global features using a multi-head attention mechanism. The feature extraction is further enhanced by two consecutive CNN modules, maximizing the mutual information between query features (sequences) and their label predictions (epitope/non-epitope) through a modified Transductive Information Maximization (TIM) loss function. TransHLA is designed to collectively consider all HLA sites across all alleles and is the first neoantigen prediction tool of its kind, since it does not require HLA alleles as input. The authors also present benchmark study results, showcasing the increased predictive accuracy of TransHLA and its potential as a valuable pre-screening tool.
  
  The computational method presented in this manuscript demonstrates a strong scientific foundation and shows promise for future refinement and extension, suggesting significant potential for meaningful research output. However, there are some conceptual and technical concerns that need to be addressed.
  
  2) Specific comments for revision a) Major Manuscript: i) Introduction - The authors distinguish between two categories of models: those that need only epitopes as input and those that require both epitopes and HLA alleles as inputs. However, the basis for this classification is unclear. For instance, MHCNuggets and DeepSeqPanII, cited as examples of the first category, actually require both an allele and an epitope to predict neoantigens. This is supported by the algorithms' manuals and the supplementary material provided by the authors, where they specify the need for HLA alleles to execute the commands.
  
  The authors state: "Considering that TransHLA is the first epitope prediction software that does not impose restrictions on HLA alleles" This needs clarification, as all available "pan-allele" models do not impose restrictions on HLA alleles (the models are trained on nearly all sequenced HLAs). Perhaps the authors meant that TransHLA does not require HLA alleles as input?
  
  ii) Results - The reason for conducting two separate benchmarks (case study and validation) with different HLA binding affinity predictors is unclear. For instance, it is not explained why netMHCpan/netMHCpanII were not included in the first benchmark and only used in the validation part.
  
  It would be very informative if the authors were able to include other widely used HLA binding affinity predictors in their benchmarks, such as mixMHCpred and mixMHCpred2.
  
  The authors state: "the details information of alleles used in each tool can be found in the Supplementary File" However, no information about the alleles used in this study is provided (or at least it was not made available to me at the time of reviewing this version of the manuscript).
  
  The "protein structural flexibility" should be briefly explained and properly cited (Vihinen et al., 1994, Proteins, 19(2), 141-149).
  
  iii) Conclusion and Discussion - The authors claim that TransHLA alleviates "the restrictive requirement of knowing the specific HLA alleles." However, this is not typically a restriction, as serological typing of HLA is routinely performed in clinics, and samples usually come with relevant metadata. Additionally, HLA typing can be easily performed with RNAseq and/or WES data, the same data usually required to produce the putative epitopes initially, with high accuracy (e.g., OptiType can reach 93.5% [CI95: 91.8-95.1%] accuracy for HLA class I). Therefore, this information is generally readily available for processing. While the authors effectively demonstrate the accuracy of TransHLA, they fail to clarify the context in which this computational tool could be utilized.
  
  To the best of my knowledge, in the research field of personalized medicine, neoantigen vaccines are typically produced at the patient level, taking the patients' HLA alleles into consideration. Binding affinity, by definition, can quantitatively differentiate between strong (low IC50) and weak (high IC50) binders. Thus, binding affinity predictions are a pivotal step for neoantigen prioritization. Given that the authors suggest TransHLA as an "alternative for filtering potential epitopes", how would TransHLA perform in such situations? To enhance clarity, the authors should elaborate on a scenario where TransHLA would be a superior choice compared to high-performing HLA binding affinity predictors in this context.
  
  The authors mention in the introduction that TransHLA can be used to "expedite the precise screening of peptides". Additionally, in their GitHub repository it is stated that TransHLA "can serve as a preliminary screening for the currently popular tools that are specific for HLA-epitope binding affinity", which is quite accurate. They might consider incorporating this concept into their concluding remarks as well.
  
  Implementation: - Since neoantigen prediction is typically carried out using computational pipelines, it would be very helpful if the authors could provide instructions for end-users to install the software and its dependencies in isolated (contained) computational environments. To enhance clarity, I am attaching the files I used to create these environments via Conda (transhla_env.yaml), Singularity (TransHLA.def), and Docker (Dockerfile).
  
  Following the previous point, the authors should consider providing a CLI (similar to the "train.py" and "inference.py" scripts in their GitHub repository) to enhance the software's usability in computational pipelines. As an example, I am attaching the script I used to test the software (TransHLA.py).
  
  b) Minor - It would enhance the clarity (especially for readers who are not familiar with artificial intelligence) if the authors would briefly explain each technical term and then use the abbreviations. For example, "Evolutionary Scale Modeling (ESM2)" and so on.
  
  Additionally, the manuscript and its supplementary material contain several grammatical and spelling errors that need to be rectified.
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2025.01.20.634002v1
Feb 2025
www.biorxiv.org www.biorxiv.org

A chromosome-level, haplotype-resolved genome assembly and annotation for the Eurasian minnow (Leuciscidae - Phoxinus phoxinus) provide evidence of haplotype diversity

2
1. GigaScience 10 Feb 2025
  
  in GigaScience
  
  In this study we present an in-depth analysis of the Eurasian Minnow (Phoxinus phoxinus) genome, highlighting its genetic diversity, structural variations, and evolutionary adaptations. We generated an annotated haplotype-phased, chromosome-level genome assembly (2n = 25) by integrating high-fidelity (HiFi) long reads and chromosome conformation capture data (Hi-C). We achieved a haploid length of 940 Mbp for haplome one and 929 Mbp for haplome two with high N50 values of 36.4 Mb and 36.6 Mb and BUSCO scores of 96.9% and 97.2%, indicating a highly complete genome.We detected notable heterozygosity (1.43%) and a high repeat content (approximately 54%), primarily consisting of DNA transposons, which contribute to genome rearrangements and variations. We found substantial structural variations within the genome, including insertions, deletions, inversions, and translocations. These variations affect genes enriched in functions such as dephosphorylation, developmental pigmentation, phagocytosis, immunity, and stress response.Protein annotation identified 30,980 mRNAs and 23,497 protein-coding genes with a high completeness score, providing further support for our genome’s high contiguity. We performed a gene family evolution analysis by comparing our proteome to ten other teleost species, which identified immune system gene families that prioritise histone-based disease prevention over NLR-based immune responses.Additionally, demographic analysis indicates historical fluctuations in the effective population size of P. phoxinus, likely correlating with past climatic changes.This annotated, phased reference genome provides a crucial resource for resolving the taxonomic complexity within the genus Phoxinus and highlights the importance of haplotype-phased assemblies in understanding haplotype diversity in species characterised by high heterozygosity.Competing Interest StatementThe authors have declared no competing interest.
  
  Reviewer 2. Alice Dennis
  
  I previously reviewed this paper previously for Peer-Community-In-Genomics and you can read these comments via the PCI-review page here: https://genomics.peercommunityin.org/articles/rec?id=333]
  
  I actually did three rounds form PCI and was more than happy with the result. I'm attaching them all here in case they didn't all make it to you.
  
  The original preprint linked to the PCI-review is here: https://doi.org/10.1101/2023.11.30.569369.
  
  I have no other concerns on the manuscript. Glad to see it published on GigaScience.
2. GigaScience 10 Feb 2025
  
  in GigaScience
  
  AbstractIn this study we present an in-depth analysis of the Eurasian Minnow (Phoxinus phoxinus) genome, highlighting its genetic diversity, structural variations, and evolutionary adaptations. We generated an annotated haplotype-phased, chromosome-level genome assembly (2n = 25) by integrating high-fidelity (HiFi) long reads and chromosome conformation capture data (Hi-C). We achieved a haploid length of 940 Mbp for haplome one and 929 Mbp for haplome two with high N50 values of 36.4 Mb and 36.6 Mb and BUSCO scores of 96.9% and 97.2%, indicating a highly complete genome.We detected notable heterozygosity (1.43%) and a high repeat content (approximately 54%), primarily consisting of DNA transposons, which contribute to genome rearrangements and variations. We found substantial structural variations within the genome, including insertions, deletions, inversions, and translocations. These variations affect genes enriched in functions such as dephosphorylation, developmental pigmentation, phagocytosis, immunity, and stress response.Protein annotation identified 30,980 mRNAs and 23,497 protein-coding genes with a high completeness score, providing further support for our genome’s high contiguity. We performed a gene family evolution analysis by comparing our proteome to ten other teleost species, which identified immune system gene families that prioritise histone-based disease prevention over NLR-based immune responses.Additionally, demographic analysis indicates historical fluctuations in the effective population size of P. phoxinus, likely correlating with past climatic changes.This annotated, phased reference genome provides a crucial resource for resolving the taxonomic complexity within the genus Phoxinus and highlights the importance of haplotype-phased assemblies in understanding haplotype diversity in species characterised by high heterozygosity.
  
  After initial review in PCI-Genomics (see https://genomics.peercommunityin.org/articles/rec?id=333), a version of this preprint has now been published in the Open Access journal GigaScience (see paper https://doi.org/10.1093/gigascience/giae116), where the paper and peer reviews are published openly under a CC-BY 4.0 license. The PCI-Genomics reviewers were consulted if they had any additional comments and these were as follows.
  
  Reviewer 1: Henrik Lantz
  
  I previously reviewed this paper previously for Peer-Community-In Genomics and you can read these comments via the PCI-review page here:
  
  https://genomics.peercommunityin.org/articles/rec?id=333
  
  The original preprint linked to the PCI-reviews is here:
  
  https://doi.org/10.1101/2023.11.30.569369.
  
  I am satisfied with the latest version of manuscript.
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2023.11.30.569369v2
www.biorxiv.org www.biorxiv.org

Knowledge Graph-based Thought: a knowledge graph enhanced LLMs framework for pan-cancer question answering

2
1. GigaScience 04 Feb 2025
  
  in GigaScience
  
  Background In recent years, Large Language Models (LLMs) have shown promise in various domains, notably in biomedical sciences. However, their real-world application is often limited by issues like erroneous outputs and hallucinatory responses.Results We developed the Knowledge Graph-based Thought (KGT) framework, an innovative solution that integrates LLMs with Knowledge Graphs (KGs) to improve their initial responses by utilizing verifiable information from KGs, thus significantly reducing factual errors in reasoning. The KGT framework demonstrates strong adaptability and performs well across various open-source LLMs. Notably, KGT can facilitate the discovery of new uses for existing drugs through potential drug-cancer associations, and can assist in predicting resistance by analyzing relevant biomarkers and genetic mechanisms. To evaluate the Knowledge Graph Question Answering (KGQA) task within biomedicine, we utilize a pan-cancer knowledge graph to develop a pan-cancer question answering benchmark, named the Pan-cancer Question Answering (PcQA).Conclusions The KGT framework substantially improves the accuracy and utility of LLMs in the biomedical field. This study serves as a proof-of-concept, demonstrating its exceptional performance in biomedical question answering
  
  This work has been peer reviewed in GigaScience (see , https://doi.org/10.1093/gigascience/giae082), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer: Cody Bumgardner
  
  We are just beginning to get a glimpse into the ways that large language models (LLMs) might advance biomedical informatics. The framework you have described I would consider a serious contribution to the state-of-the-art in the area of bridging LLMs and structured data. The use of LLMs for code generation and interpretation within the same request is also innovative. The application of your framework to MeSH (https://www.nlm.nih.gov/mesh/meshhome.html) and other broader linked ontologies would be very interesting. You might also consider integrating tool calling as well (which in a way you are with subgraphs), to either further reduce the demential space or accessing data that does not otherwise have a graph structure. In this case, the content of your subgraph nodes might be the result of a function call. Congratulations on your work, it is a real contribution to our community.
2. GigaScience 04 Feb 2025
  
  in GigaScience
  
  Background In recent years, Large Language Models (LLMs) have shown promise in various domains, notably in biomedical sciences. However, their real-world application is often limited by issues like erroneous outputs and hallucinatory responses.Results We developed the Knowledge Graph-based Thought (KGT) framework, an innovative solution that integrates LLMs with Knowledge Graphs (KGs) to improve their initial responses by utilizing verifiable information from KGs, thus significantly reducing factual errors in reasoning. The KGT framework demonstrates strong adaptability and performs well across various open-source LLMs. Notably, KGT can facilitate the discovery of new uses for existing drugs through potential drug-cancer associations, and can assist in predicting resistance by analyzing relevant biomarkers and genetic mechanisms. To evaluate the Knowledge Graph Question Answering (KGQA) task within biomedicine, we utilize a pan-cancer knowledge graph to develop a pan-cancer question answering benchmark, named the Pan-cancer Question Answering (PcQA).Conclusions The KGT framework substantially improves the accuracy and utility of LLMs in the biomedical field. This study serves as a proof-of-concept, demonstrating its exceptional performance in biomedical question answering.
  
  This work has been peer reviewed in GigaScience (see , https://doi.org/10.1093/gigascience/giae082), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer: Linhao Luo
  
  Summary: This paper proposes a novel framework called KGT that integrates Large Language Models (LLMs) with Knowledge Graphs (KGs) for pan-cancer question answering. The KGT framework can effectively retrieve knowledge from KGs and improve the accuracy of LLMs for question answering. Moreover, it can provide interpretable and faithful explanations with the help of structured KGs. Comments: 1. This paper construct a new dataset denoted as PcQA form a customized KG called SOKG for the evaluation of pan-cancer question answering. This is a great contribution to the community. However, it is unclear how to constuct such a dataset. More details about the construnction process and statistics of the final datasets should be disscussed in the paper. For example, how to generate the natural language questions and answers? How to link the question with relatived KG information (i.e., entity and relation)? How many questions can be answered by the KGs (i.e., answer converage rate). How many questions have been generated? What is the ratio of each quetion types defined in Table 2? 2. In Table2, the author define 4 reasoning types. How about other reasoning types such union and negation? Can we incorpate these tpes into the datasets? 3. The propsed method is novel and interesting. However some details are unclear. In the candidate path search, do we want to search reasoning paths or relational chains? The definition of these two paths are also unclear. Please give clear definition of them in prelimary. If is the reasoning paths, do we only keep the type information during the BFS? 4. I do not understand why we need to generatea cypher query to retrieve subgraph then construct relation paths from KG. We can directly retrieved relational paths from KGs by BFS. What are the benefits and motivations of using this two-stage pipeline? 5. What are the meanings of the X and âˆš in the figure. How to get them? 6. In experiments, other advanced KGQA methods can be compared, e.g., RoG [1] and ToG [2]. 7. The analysis of used token, time, and cost should be disscussed in the paper. 8. Can we apply the proposed metod to other KGs (i.e., SynLethKG, and SDKG) or KGQA tasks (MetaQA, and FACTKG) to show the generability. [1] LUO, L., Li, Y. F., Haf, R., & Pan, S. Reasoning on Graphs: Faithful and Interpretable Large Language Model Reasoning. In The Twelfth International Conference on Learning Representations. [2] Sun, J., Xu, C., Tang, L., Wang, S., Lin, C., Gong, Y., ... & Guo, J. (2023). Think-on-graph: Deep and responsible reasoning of large language model with knowledge graph. arXiv preprint arXiv:2307.07697
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2024.04.17.589873v2
www.biorxiv.org www.biorxiv.org

A practical DNA data storage using expanded alphabet introducing 5-methylcytosine

2
1. GigaScience 04 Feb 2025
  
  in GigaByte
  
  Editors Assessment:
  
  DNA has huge potential as a data storage medium because of its incredibly high storage density and stability. This work addresses the potential of modified bases, specifically 5-methylcytosine (5mC), in enhancing DNA data storage systems. This paper introduces a transcoding scheme named R+, which incorporates this modified 5mC base to increase information density beyond the standard limits. By encoding various file types into DNA sequences of between 1.3 to 1.6 kb in size, this method achieves an average recovery rate of 98.97% (with reference), validating the effectiveness of the method. On top of a wet-lab protocol (hosted in protocols.io) for the experimental validation of the transcoding scheme, it also includes open source code for in-silico simulation tests. Peer review scruitinising the protocols and validation are reusable and provide convincing results. As nanopore sequencing has enabled reading of these modified bases, it is timely making them applicable as extra letters in the molecular alphabet for DNA data storage
  
  This evaluation refers to version 1 of the preprint
  
  Summary
2. GigaScience 04 Feb 2025
  
  in GigaByte
  
  AbstractDNA molecular is a promising next-generation data storage medium. Recently, it has been theoretically proposed that non-natural or modified bases can serve as extra molecular letters to increase the information density. However, the feasibility of the strategy is challenging due to the difficulty in synthesizing and the complex structure of non-natural DNA sequences. Here, we described a practical DNA data storage transcoding scheme named R+ based on expanded molecular alphabet by introducing 5-methlcytosine(5mC). We also demonstrated the experimental validation by encoding one representative file into several 1.3~1.6 kbps in vitro DNA fragments for nanopore sequencing. The results show an average data recovery rate of 98.97% and 86.91% with and without reference respectively. This work validates the practicability of 5mC in DNA storage systems, with a potentially wide range of applications.Availability & Implementation R+ is implemented in Python and the code is available under the MIT license at https://github.com/Incpink-Liu/DNA-storage-R_plus
  
  This work has been published in GigaByte Journal under a CC-BY 4.0 license (https://doi.org/10.46471/gigabyte.147). These reviews (including a protocol review) are as follows.
  
  Reviewer 1. Abdur Rasool
  
  Is the source code available, and has an appropriate Open Source Initiative license been assigned to the code? However, the Git links have a typo; the working code is available at https://github.com/Incpink-Liu/DNA-storage-R_plus
  
  Is the code executable?
  
  Unable to test. Complete execution of the given code requires time and resources.
  
  Is installation/deployment sufficiently outlined in the paper and documentation, and does it proceed as outlined? Unable to test. Additional Comments: This manuscript focuses on DNA data storage based on an expanded molecular alphabet. In view of the challenges of non-natural bases in synthesis, sequencing, and compatibility, the manuscript proposes a DNA data storage scheme containing 5-methylcytosine based on the theory that modified bases can replace non-natural bases as extra molecular letters and develops an adaptive transcoding algorithm named R+ for corresponding experimental validation. The high data recovery rate obtained from sequencing analysis demonstrates its practicability.
  
  This manuscript provides a simple but relatively universal transcoding algorithm for DNA data storage that introduces additional molecular letters. The proposed DNA data storage scheme outperforms conventional DNA data storage in the potential development of information density. Considering the anticipated decrease in future synthesis costs and the expected advancements in relevant transcoding algorithms, my outlook remains optimistic regarding the potential application of this scheme. I suggest that the manuscript could be accepted after a few minor revisions listed below:
  
  Figure 3 in the paper could be further modified, specifically minimizing the excess white space on both sides of Subfigure A to make it more aesthetically pleasing.
  
  The subfigures A, B, and D in Figure 2 and Figure S2 both demonstrate the difference between poem.txt/program.py and the other four files. However, the manuscript lacks an explanation for this phenomenon. Is it relevant to the file size?
  
  The 8 nt adaptors play a key role during the sequence assembly in the experimental validation, so I suggest supplementing the specific generation process of these linkers. Text descriptions or flow charts are acceptable.
  
  It’s better to add the silico simulation to the Methods to make its structure more complete.
  
  For the practicality of DNA storage, I suggest to cite https://onlinelibrary.wiley.com/doi/10.1002/smtd.202301585 and https://academic.oup.com/bib/article/25/5/bbae463/7759103.
  
  Provide the correct URLs of GitHub links for reproducibility.
  
  Reviewer 2. Bi Kun
  
  Are there (ideally real world) examples demonstrating use of the software?
  
  No. Additional Comments:
  
  In this study, a practical DNA data storage transcoding scheme named R+ based on expanded molecular alphabet is proposed to increase the information density. The experimental validation demonstrates the practicability of DDS-5mC and highlight the enormous potential of modified bases represented by 5mC in the field of DNA data storage. Overall, the methods and results look appropriate and promising, but it has minor issues that need to be addressed currently.
  
  1.Please indicate the proportion of substitution: insertion: deletion in the error rates of Fig. 4C and D. 2.What is the meaning of the vertical axis of Fig. 2B? Is it the number of homopolymers per sequence, the longest length of homopolymers, or something else? 3.Line 304, please add s, "References" 4.The last sentence of the Abstract: "This work validates the practicability of 5mC over other non-natural bases in DNA storage systems". Please correspond it with the last paragraph of Results (151-154). 5.If necessary, according to the guideline of this journal, section Conclusion can be added or not.
  
  Reviewer 3. Lifu Song
  
  This manuscript explores the application of 5-methylcytosine (5mC) as an additional molecular letter in DNA data storage systems, expanding the molecular alphabet to increase information density. The authors present a novel transcoding scheme (R+) and validate it with both in silico and experimental data. The study explores GC content, homopolymer distribution, and data recovery rates under various conditions, offering detailed insights into practical applications. Experimental validation with nanopore sequencing demonstrates real-world feasibility. By improving storage density and ensuring compatibility with nanopore sequencing, the study addresses significant challenges in incorporating non-natural bases into DNA storage systems. Overall, the manuscript is well-structured and addresses a highly relevant topic in DNA data storage, offering valuable contributions to the field. I recommend it for publication, subject to minor revisions to enhance clarity and precision.
  
  Suggested minor revisions: 1) Although substitution errors, particularly between C and 5mC, were discussed, the manuscript does not provide a detailed explanation of how these errors affect long- term storage or large-scale applications—both of which are critical for archival storage, the primary use case of DNA data storage technology. 2) The manuscript could benefit from a broader comparison with other high-density DNA storage strategies, such as composite DNA letters, to contextualize the benefits and limitations of 5mC. 3) The discussion could be expanded to address practical challenges, such as strategies to reduce synthesis costs and improve sequencing accuracy for modified bases like 5mC, to provide a more holistic perspective on the technology's scalability.
  
  Protocol Review: I have taken a look at the experiment protocol associated with this manuscript in the website of protocols.io. The protocol looks sensible. I don't have any additional comments about it and am happy for it to go live.
  
  See: https://dx.doi.org/10.17504/protocols.io.q26g7mr78gwz/v1
Visit annotations in context

Tags

Summary

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2024.12.26.630439v1
Jan 2025
www.biorxiv.org www.biorxiv.org

Polyploid genome assembly of Cardamine chenopodiifolia

2
1. GigaScience 08 Jan 2025
  
  in GigaByte
  
  This evaluation refers to version 1 of the preprint
  
  This work presents the genome of Cardamine chenopodiifolia, an amphicarpic plant (developing two fruit types, one above and another below ground) in the mustard (Brassicaceae) family. Cardamines also known as bittercresses and toothworts. As an octoploid species it has been challenging to create a genome reference for this species, and in this case the authors finally managed to achieve this using PacBio HiFi long-reads and Omni-C technology to assemble a fully phased, chromosome-level genome. Obtaining a 597Mb genome assembled into 32 phased chromosomes (plus mitochondrial and plastid genomes), and only having one gap in the centromeric region of chromosome 9. Peer review asked for additional QC and benchmarking, helping demonstrate the genome quality was very high, with only one gap and a N50 of 18.80Mb. The data presented here potentially helping to develop this species as an emerging model organism in the Brassicaceae for studying the development and evolution of amphicarpy by allopolyploidy.
  
  This evaluation refers to version 1 of the preprint
  
  Summary
2. GigaScience 08 Jan 2025
  
  in GigaByte
  
  AbstractBackground Cardamine chenopodiifolia is an amphicarpic plant that develops two fruit morphs, one above and the other below ground. Above-ground fruit disperse their seeds by explosive coiling of the fruit valves, while below-ground fruit are non-explosive. Amphicarpy is a rare trait that is associated with polyploidy in C. chenopodiifolia. Studies into the development and evolution of this trait are currently limited by the absence of genomic data for C. chenopodiifolia.Results We produced a chromosome-scale assembly of the octoploid C. chenopodiifolia genome using high-fidelity long read sequencing with the Pacific Biosciences platform. We successfully assembled 32 chromosomes and two organelle genomes with a total length of 597.2 Mbp and an N50 of 18.8 kbp (estimated genome size from flow cytometry: 626 Mbp). We assessed the quality of this assembly using genome-wide chromosome conformation capture (Omni-C) and BUSCO analysis (97.1% genome completeness). Additionally, we conducted synteny analysis to infer that C. chenopodiifolia likely originated via allo-rather than auto-polyploidy and phased one of the four sub-genomes.Conclusions This study provides a draft genome assembly for C. chenopodiifolia, which is a polyploid, amphicarpic species within the Brassicaceae family. This genome offers a valuable resource to investigate the under-studied trait of amphicarpy and the origin of new traits by allopolyploidy.
  
  Reviewer 1. Rie Shimizu
  
  This manuscript deciphers the complicated genome of an octoploid species, Cardamine chenopodiifolia. They successfully assembled a chromosome-level genome with 32 chromosomes, consistent with the chromosome counting. They evaluated the quality of the genome by several methods (mapping Omni-C reads, BUSCO, variant calling etc.). All benchmarks ensured the high quality of their assembly. They even tried to phase the chromosomes into four subgenomes, and one subgenome was successfully phased thanks to its higher divergence compared to the other three sets. Despite their intensive effort, the other three subgenomes could not be phased, suggesting the relationship originated from the same or closely related species. As a whole, the manuscript is very well written and describes enough details, and the genome data looks like it is already available in a public database. They even added a description of the biological application of this assembly about the amphicarpy.
  
  I only found a few minor points for which I kindly suggest reconsideration/rephrasing before publication, as listed below. *As the review PDF does not contain the line numbers, I suggest the original description at the first line and then write my comments.
  
  –C. chenopodiifolia genome is octoploid …, suggesting that its genome is octoploid. They compare the 8C peak of C. hirsuta and 2C peak of the target, but considering the genome size variation among Cardamine species, I do not think this is an appropriate expression. The pattern may mean ‘consistent’ with the expectation from C. hirsuta peaks but does not ‘suggest’ octoploidy. -C. chenopodiifolia chromosome-level genome assembly PacBio Sequel II platform. Here and nowhere, they do not mention the mode of sequencing (only found in method and the title of a table). Maybe ‘HiFi’ could be added here to make the method clearer. -Table 2. It would make more sense to overview the genome quality if the N90 and L90 (or similar, if it is already fragmented at L90) values are added. (maybe the same for Table 1). Otherwise Nx curves would be also fine for the same purpose. -We obtained only 20800 variants,…as expected for a selfing species. It might be partially due to selfing in wild habitat, but also by selfing (5 times) in the lab. This should be mentioned here to avoid misleading. -Table 4 The unit of each item (bp, number, frequency…?) should be suggested. In addition to the points listed above, I appreciate more Information about the phased chromosomes set: Total subgenome sizes of this set and the other three sets?(1:3 or imbalanced?) It would be even better with a synteny plot in addition to the colinear plot as Fig 3C. (e.g. by GENESPACE or something similar, including phased and unphased chenopodiifolia chromosome sets and C. hirsuta)
  
  Reviewer 2. .Qing Liu
  
  This manuscript “Polyploid genome assembly of Cardamine chenopodiifolia” produced a chromosome-scale assembly of the octoploid C. chenopodiifolia genome using highfidelity long read sequencing with the Pacific Biosciences platform with two organelle genomes with a total length of 597.2 Mb and an N50 of 18.8 Mb together with BUSCO analysis (99.8% genome completeness), and phased one of the four sub-genomes. This study provides a valuable resource to investigate the understudied trait of amphicarpy and the origin of new traits by allopolyploidy. The manuscript is suitably edited and significant data for amphicarpy breeding of C. chenopodiifolia except for the below revision points. The major revision is suggested for the current version of the manuscript.
  
  1 Please elucidate “an N50 of 18.8 Mb”, which is Contig or Scaffold N50 length. 2 Please elucidate “originated via allo- rather than auto-polyploidy”, which is “originated via allopolyploidy rather than autopolyploidy”. 3 Please substitute the word “understudied trait” using alternative sensible word. 4 “to phase this set of chromosomes by gene tree topology analysis”, it is suggested to be “to phase this set of chromosomes by gene phylogeney analysis”. 5 In the first section of Resuts, Cardamine chenopodiifolia genome is octoploid is suggested. 6 Could Table 1 and Table2 be combined as one table to present the sequencing and assembly characterization of C. chenopodiifolia genome. 7 Could the entromere locations be predicted in Table 5, which is the 32 chromosome summary of C. chenopodiifolia genome. 8 In Table 2, assembly 32 chromosomes including two organelles, which is not close related with the C. chenopodiifolia genome, from my point of view, two organelle genome assembly do not critical section of manuscript. 9 Could all figure numbers are ordered below each group figures, for example the below figure should be numbered before the Figure 2A (according group figure presence order). I wonder it is Figure 2, authors want to elucidate the chromosome number 2n=42, while I can’t count out 42 chromosomes from present format.Could authors using alternative clear figure to show the cytological evidence of C. chenopodiifolia chromosome number. 10 In Figure 5A, it is difficult to point out the clear meaning for first-diverged chromosome from gene tree, which is a phylogenetic meaning tree or just framework, could author redraw this Figure 5A in order to reader got what you mean.
  
  Reviewer 3. Kang Zhang.
  
  The paper produced a chromosome-scale assembly of the C. chenopodiifolia genome in the Brassicaceae family, and offers a valuable resource to investigate the understudied trait of amphicarpy and the origin of new traits by allopolyploidy. I have the following comments which can be considered to improve the ms.
  
  Major points. 1.The introduction states that Cardamine is among the largest genera within the Brassicaceae family. The octaploid model species C. occulta and the diploid C. hirsuta have been sequenced. Therefore, I propose that a description of the evolutionary relationships among various species be included here. Additionally, the significance of the amphicarpic trait in the study of plant evolution and adaptation could be highlighted when discussing their octoploid characteristics. 2.The paper omits a detailed description of genome annotation and significant genomic features, which are essential for clearly illustrating the characteristics of the genome. To enhance this aspect, it would be beneficial to include a circular chart that displays fundamental components such as gene density, CG content, TE density, and collinearity links, among others. 3.The authors employed various techniques to differentiate the four subgenomic sets within the C. chenopodiifolia genome and ultimately managed to isolate a single sub-genomic set. The paper references the assembly of the octaploid genome of another model plant, C. occulta, within the same genus. Could it be utilized to compare with C. chenopodiifolia to achieve improvements? In addition, I suggest the authors to examine the gene density differences among these subgenomes, which could be helpful in distinguishing them. 4.Little important information were included in Table 1, 3, and Figure 4. These tables and figures should be moved to Supplementary data. 5.Evidence from Hi-C heatmap should be provided to validate the structural variations among different sets of subgenomes, such as those in Figure 3.
  
  Minor points. 1.Figure 5B, please change the vertical coordinate ‘# gene pairs’ to ‘Number of gene pairs’. The fonts in some figures are a little bit small. I suggest to adjust them to make it easy to read.
Visit annotations in context

Tags

Summary

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2024.01.24.576990v1
www.biorxiv.org www.biorxiv.org

The genome of the sapphire damselfish Chrysiptera cyanea: a new resource to support further investigation of the evolution of Pomacentrids

2
1. GigaScience 08 Jan 2025
  
  in GigaByte
  
  Editors Assessment:
  
  Among hot topics in coral reef research, the difference between anemonefish and other damselfish is currently a popular area of research. In this study the authors provide a new high-quality non-anemonefish genome, which will be of high relevance to further the depth of such analyses. In this case of the sapphire damselfish Chrysiptera cyanea, a widely distributed damselfish in the Indo-Pacific area, often studied to elucidate the roles of various environmental controls on their reproduction, and investigate related hormonal processes To further the potential of biomolecular analyses based on this species, this study generated the first genome of a Chrysiptera fish from a male individual collected in Okinawa, Japan. Using PacBio and HiFI long-read sequencing with 94.5x coverage, a chromosome-scale genome was assembled and 28,173 genes identified and annotated. Peer review gathered more parameters and details on the quality, and the final assembly comprised of 896 Mb pairs across 91 contigs, and a BUSCO completeness of 97.6%. This reference genome should therefore be of high value for future genetic-based approaches, from population structure to gene expression analyses.
  
  This evaluation refers to version 1 of the preprint
  
  Summary
2. GigaScience 08 Jan 2025
  
  in GigaByte
  
  AbstractThe number of high-quality genomes is rapidly growing across taxa. However, it remains limited for coral reef fish of the Pomacentrid family, with most research focused on anemonefish. Here, we present the first assembly for a Pomacentrid of the genus Chrysiptera. Using PacBio long-read sequencing with a coverage of 94.5x, the genome of the Sapphire Devil, Chrysiptera cyanea was assembled and annotated. The final assembly consisted of 896 Mb pairs across 91 contigs, with a BUSCO completeness of 97.6%. 28,173 genes were identified. Comparative analyses with available chromosome-scale assemblies for related species identified contig-chromosome correspondences. This genome will be useful to use as a comparison to study the specific adaptations linked to symbiosis life of the closely related anemonefish. Furthermore, this species is present in most tropical coastal areas in the Indo-West Pacific and could become a model for environmental monitoring. This work will allow to expand coral reef research efforts and highlights the power of long-read assemblies to retrieve high quality genomes.
  
  This work has been published in GigaByte Journal under a CC-BY 4.0 license (https://doi.org/10.46471/gigabyte.144). These reviews are as follows.
  
  Reviewer 1. Darrin T. Schultz
  
  Are all data available and do they match the descriptions in the paper?
  
  No. The genome is also not yet on NCBI, but it would be good to upload it.
  
  Are the data and metadata consistent with relevant minimum information or reporting standards?
  
  Yes. I suggest later that there should be more information about the HiFi library preparation details, as the manuscript lacks them and it appears to be a non-standard (large insert size) library.
  
  Is the data acquisition clear, complete and methodologically sound?
  
  No. See above comment-
  
  Is there sufficient detail in the methods and data-processing steps to allow reproduction?
  
  No. No parameters are provided for the genome assembly software, for read trimming, or for other software used.
  
  Is there sufficient data validation and statistical analyses of data quality?
  
  No. See extended comments - the read data could use more QC, as well as the genome assembly.
  
  Is the validation suitable for this type of data?
  
  No.
  
  Is there sufficient information for others to reuse this dataset or integrate it with other data?
  
  Yes. There is a degree of information missing about the data, but another researcher could use them for their study.
  
  Additional Comments:
  
  Thank you for the opportunity to review the work, The genome of the sapphire damselfish Chrysiptera cyanea: a new resource to support further investigation of the evolution of Pomacentrids, by Gairin and colleagues. In this manuscript, the authors collect an individual of the pomocentrid fish, Chrysiptera cyanea, in Okinawa, Japan. After isolating DNA, the sequencing center at OIST prepared and sequenced a SMRT sequencing library. Additionally, the authors generated some bulk RNA-seq data and sequenced it on the Illumina platform. The authors assembled the genome with two assemblers, and performed some comparisons of the C. cyanea contigs aligned to the chromosome-scale scaffolds of closely related pomacentrids. Given my background, I will mostly comment on the genomic analyses.
  
  I appreciate the authors' diligence in exploring different genome assembly methods and their efforts in running BUSCO and QUAST to QC the assemblies. The DNA sequencing data and assembly produced contigs that align well with the chromosomes of closely related species (which is convenient for comparative genomics!), and the manuscript presents a solid foundation for better understanding the chromosomal evolutionary history of the Pomacentridae.
  
  While this work represents an important step toward providing a new genomic resource for Chrysiptera cyanea, I see a few areas where the manuscript could be refined to enhance it as a community resource:
  
  (1) More information about data generation: Including additional details about the HiFi library preparation, specifically the chemistries used, the number of SMRT cells sequenced, and the bioinformatics steps used to generate the HiFi reads, would improve the manuscript's clarity and reproducibility. I have some questions regarding whether these libraries were prepared for HiFi sequencing: the reported mean read length of 25kbp is 10kbp longer than the standard HiFi library insert size; and the reported amount of bases in the reads, 84 Gbp, is more data than one would expect from a single CCS-processed SMRT cell, but could be the amount of data produced from one CLR run. Characterizing the quality score vs read length distribution could be helpful to characterize the read data. Clarifying these steps taken before the genome was assembled would strengthen the reliability of these reads as a resource.
  
  (2) Incorporating a few more important quality control (QC) steps would better clarify the completeness of the genome assembly. For instance, an estimate of genome size from the HiFi reads could be performed with jellyfish and GenomeScope, taking advantage of the k-mer fidelity of HiFi reads. This would provide a more conclusive estimate than the current comparison. Additionally, steps such as checking for contamination and providing an explanation for decisions like haplotig removal would make the assembly process more transparent. Lastly, supplementing the QC analysis with Merqury will provide a reliable answer to how complete the assembly represents the information in the individual HiFi reads in a way that complements BUSCO and QUAST.
  
  (3) The initial analyses of chromosome structure are a promising look into some yet-unexplored chromosomal changes in the Pomacentridae, and I think that incorporating a deeper phylogenetic analysis would build on this strength. Situating the chromosomal findings within a phylogenetic framework could provide stronger support, or actually resolve, the evolutionary interpretations presented. Doing this analysis likely could also help resolve whether the structures seen are genome misassemblies, or instead reflect lineage-specific chromosomal changes. The authors could supplement their beautiful figures using other tools that leverage whole-genome alignments and chromosome visualization to help answer these questions. One tool to try for two-genome comparisons, that the authors may have explored already in place of their ggplot script, is D-GENIES.
  
  Overall, this is a valuable resource, and I commend the authors for taking the steps to analyze the chromosomal evolutionary history within the pomacentrids. I look forward to seeing the authors’ future contributions to the field of genomics and chromosome evolution.
  
  Minor Points Line 125: Sharing the specific Trimmomatic settings used would enhance the reproducibility of the RNA-seq data processing. The parameters for genome assembly should also be added. Line 212: Are there any replicates for the RNA-seq data? Line 294: Consider uploading the assembly to NCBI for broader visibility and accessibility.
  
  Reviewer 2. Yue Song.
  
  Are all data available and do they match the descriptions in the paper?
  
  No. The authors have provided clues for accessing the data in public databases such as NCBI, but it seems that the data has not been released; At least, I haven't been able to obtain available data using the provided accession number (e.g. PRJNA1167451). I'm not sure if I've missed any information, but I believe it would be better if the data could be easily accessible to the public.
  
  Is the data acquisition clear, complete and methodologically sound?
  
  No. The authors used PacBio's third-generation sequencing technology for genome sequencing, which has become a "necessary option" for obtaining high-quality genomes in current genomic research. However, they did not further advance on the path of "assembling a chromosome-level genome" based on this version. Providing a chromosome-level genome would likely be more meaningful.
  
  Is there sufficient detail in the methods and data-processing steps to allow reproduction?
  
  No. Regarding the genome assembly and annotation process, the method described by the authors is overly simplistic and lacks detailed information on the parameters and procedures used. This makes it difficult for other researchers to effectively replicate the results described in the article.
  
  Is there sufficient data validation and statistical analyses of data quality?
  
  No. The authors have calculated the N50 of contigs and the completeness of BUSCO genes, which are indeed two commonly used indicators for assessing the quality of genome assemblies. However, it is still challenging to gain a clear understanding of the assembly quality based solely on these two indicators. Could other measurements be added, such as comparing the continuity and completeness of the assembly with those of closely related species or other comparable species' genomes? Additionally, there is a point that is difficult to understand: the authors report a BUSCO completeness of approximately 94% for the genome, yet a BUSCO completeness of 97% for the gene set. It is puzzling how BUSCO genes that are not annotated in the genome can still be present in the gene set.
  
  Is there sufficient information for others to reuse this dataset or integrate it with other data?
  
  No. As I mentioned earlier, the authors did not provide detailed information about the processing procedures and parameters, which makes it difficult for other researchers to replicate their results.
  
  Additional Comments: It is recommended that the authors provide a detailed description of the methods and easily accessible data retrieval methods. It would be even better if the authors could further provide a chromosome-level genome, as T2T (telomere-to-telomere) level genomes are becoming increasingly popular.
Visit annotations in context

Tags

Summary

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2024.11.06.622371v1
www.biorxiv.org www.biorxiv.org

PlasGO: enhancing GO-based function prediction for plasmid-encoded proteins based on genetic structure

4
1. GigaScience 05 Jan 2025
  
  in GigaScience
  
  Plasmid, as a mobile genetic element, plays a pivotal role in facilitating the transfer of traits, such as antimicrobial resistance, among the bacterial community. Annotating plasmid-encoded proteins with the widely used Gene Ontology (GO) vocabulary is a fundamental step in various tasks, including plasmid mobility classification. However, GO prediction for plasmid-encoded proteins faces two major challenges: the high diversity of functions and the limited availability of high-quality GO annotations. Thus, we introduce PlasGO, a tool that leverages a hierarchical architecture to predict GO terms for plasmid proteins. PlasGO utilizes a powerful protein language model to learn the local context within protein sentences and a BERT model to capture the global context within plasmid sentences. Additionally, PlasGO allows users to control the precision by incorporating a self-attention confidence weighting mechanism. We rigorously evaluated PlasGO and benchmarked it against six state-of-the-art tools in a series of experiments. The experimental results collectively demonstrate that PlasGO has achieved commendable performance. PlasGO significantly expanded the annotations of the plasmid-encoded protein database by assigning high-confidence GO terms to over 95% of previously unannotated proteins, showcasing impressive precision of 0.8229, 0.7941, and 0.8870 for the three GO categories, respectively, as measured on the novel protein test set.
  
  This work has been peer reviewed in GigaScience (see paper), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer name: **David Burstein ** Version: Revision 1
  
  Review content: The authors thoroughly answered all my questions and addressed all the raised concerns. I have no further comments, and I congratulate them on a well executed study.
2. GigaScience 05 Jan 2025
  
  in GigaScience
  
  Plasmid, as a mobile genetic element, plays a pivotal role in facilitating the transfer of traits, such as antimicrobial resistance, among the bacterial community. Annotating plasmid-encoded proteins with the widely used Gene Ontology (GO) vocabulary is a fundamental step in various tasks, including plasmid mobility classification. However, GO prediction for plasmid-encoded proteins faces two major challenges: the high diversity of functions and the limited availability of high-quality GO annotations. Thus, we introduce PlasGO, a tool that leverages a hierarchical architecture to predict GO terms for plasmid proteins. PlasGO utilizes a powerful protein language model to learn the local context within protein sentences and a BERT model to capture the global context within plasmid sentences. Additionally, PlasGO allows users to control the precision by incorporating a self-attention confidence weighting mechanism. We rigorously evaluated PlasGO and benchmarked it against six state-of-the-art tools in a series of experiments. The experimental results collectively demonstrate that PlasGO has achieved commendable performance. PlasGO significantly expanded the annotations of the plasmid-encoded protein database by assigning high-confidence GO terms to over 95% of previously unannotated proteins, showcasing impressive precision of 0.8229, 0.7941, and 0.8870 for the three GO categories, respectively, as measured on the novel protein test set.
  
  This work has been peer reviewed in GigaScience (see paper), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer name: **Nguyen Quoc Khanh Le ** Version: Revision 1
  
  Review content: No further comments to authors.
3. GigaScience 05 Jan 2025
  
  in GigaScience
  
  Plasmid, as a mobile genetic element, plays a pivotal role in facilitating the transfer of traits, such as antimicrobial resistance, among the bacterial community. Annotating plasmid-encoded proteins with the widely used Gene Ontology (GO) vocabulary is a fundamental step in various tasks, including plasmid mobility classification. However, GO prediction for plasmid-encoded proteins faces two major challenges: the high diversity of functions and the limited availability of high-quality GO annotations. Thus, we introduce PlasGO, a tool that leverages a hierarchical architecture to predict GO terms for plasmid proteins. PlasGO utilizes a powerful protein language model to learn the local context within protein sentences and a BERT model to capture the global context within plasmid sentences. Additionally, PlasGO allows users to control the precision by incorporating a self-attention confidence weighting mechanism. We rigorously evaluated PlasGO and benchmarked it against six state-of-the-art tools in a series of experiments. The experimental results collectively demonstrate that PlasGO has achieved commendable performance. PlasGO significantly expanded the annotations of the plasmid-encoded protein database by assigning high-confidence GO terms to over 95% of previously unannotated proteins, showcasing impressive precision of 0.8229, 0.7941, and 0.8870 for the three GO categories, respectively, as measured on the novel protein test set.
  
  This work has been peer reviewed in GigaScience (see paper), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer name: **David Burstein **
  
  Review content:
  
  In this paper, the authors introduce "PlasGO," a language model for GO annotation of plasmid proteins. The model takes into account two levels of representation: (1) the amino acid level, producing embeddings of the analyzed proteins based on a foundation protein language model, and (2) the plasmid gene level, where the aa-based embeddings are considered as part of a language model representing each protein in the genetic context in which it is encoded. This approach leverages the modular organization of different functions on plasmid genomes. Benchmarking performed by the authors against other deep-learning GO annotation algorithms demonstrates a considerable improvement of PlasGO over existing methods. The research is timely, well-performed, and clearly explained. Main issues: 1. The authors acknowledge that only a relatively small portion of the proteins in their database have GO term annotations, which may limit the model's ability to learn plasmid patterns effectively. As they correctly point out, an iterative approach could be useful to improve performance. Specifically, high-confidence GO annotations predicted by PlasGO could be used as input for another round of prediction, and this process can be repeated until no new reliable predictions are produced. Given that the authors have all the data and models required to run such an iterative search, I would warmly recommend doing so and reporting if and how the predictions improve. 2. The gLM model (Hwang et al.) is highly similar to PlasGO in terms of the general approach, combining protein embedding (ESM2 in gLM) with genomic contextual data. Discussing the differences between the approaches and comparing their performances would provide important context and highlight the novelty of PlasGO. 3. The agreement of the PlasGO prediction with the GO terms retrieved from sequence databases ("ground truth") was determined by calculating the ratio of terms shared between the high-confidence predictions and ground truth, divided by the number of high-confidence predictions. This measure is asymmetrical and might generate over-optimistic results. At the extreme, if the algorithm produces a very large number of predictions, this value will tend to be very high just because there are many more GO terms predicted than GO terms in the ground truth. I strongly recommend using a symmetrical measure, such as the Jaccard index. 4. The methodology for calculating average precision and recall is potentially skewed. The authors compute average precision over proteins with at least one annotation, ignoring proteins lacking annotation (instead of counting these as misclassifications). This approach makes sense given that numerous plasmid proteins lack GO annotations. However, the average recall is calculated across all proteins (N). For unannotated proteins, the correct classification is not defined. Since these cases are also considered in the measure of recall, I assume PlasGO high-confidence predictions were considered correct. This seems like a problematic assumption that might lead to skewed results. I would therefore suggest that unannotated proteins be omitted from the recall calculation, as was done in the precision calculation. 5. The authors identify and filter out "elusive" GO terms that are difficult to predict. This is reasonable in the scope of this paper, but since it is still a central limitation of PlasGO, I would suggest discussing (even if not implementing) approaches to improve the predictions in these challenging cases. 6. In Figures 8 and 9, a perfect AUPR of 1 is reported in several cases. Such perfect classification performances are highly unusual and warrant an examination to double-check this result and if it persists discuss the underlying reasons for these perfect results. 7. The masking approach during training is not entirely clear. If I understand correctly, annotated proteins are being masked during prediction. This is expected to lead to the loss of a lot of contextual information. On the other hand, during training, the unannotated proteins are masked, losing potentially informative sequence data. I would suggest splitting complete plasmids between train/test/validation sets, and if needed, performing cross-validation to cover the entire dataset. This way for each plasmid the entire protein sequence and context information will be used. 8. There seems to be somewhat of a contradiction between the two following statements appearing in the paper: (1) "CaLM, despite being a pre-trained PLM, did not surpass the top three tools using ProtTrans, which is consistent with the results reported in CaLM's paper" and (2) "Experimental results demonstrate that the protein representations derived from CaLM outperform other PLMs in the classification of GO terms." Furthermore, other PLMs, such as ESM, performed better at GO annotation prediction according to the CaLM paper. These might have been more appropriate for this task. CodonBERT, a codon-based PLM also based on ProtTrans, could also have been a suitable alternative.
  
  Minor issues:- To improve the reading flow of the paper, consider reordering the ablation section to precede the "Performance on the RefSeq test set" section, since the ablation studies section provides the rationale for the choices of architecture and foundation protein language model.- "We initially downloaded all available plasmids from the NCBI RefSeq database" - I would suggest specifying the query or approach used to acquire all plasmids from RefSeq.- I would recommend using the term "protein embedding" instead of "protein token," which may be misleading. The term "token embeddings" used in Figure 3 is more accurate than "protein token," and yet "protein embeddings" is probably the most accurate term in this case.- Figure 1: To provide an accurate depiction of representative plasmids, I suggest including unannotated genes in Figure 1.- Figure 4: "Global average pooling" was misspelled.- Figure 10: "The prediction precision of PlasGO is determined by calculating the ratio of the number of proteins in set A that are also present in set B to the total number of predicted high-confidence proteins (|A|)". If I understand the figure correctly, it should be "number of GO terms" instead of "number of proteins" in both cases.- A figure (or supplementary figure) depicting one of the plasmids with some of the high-confidence predictions in the case study section (along the same lines as Figure 1 but with a distinction between previously known and unknown annotations) could enhance the clarity and impact of the results.
4. GigaScience 05 Jan 2025
  
  in GigaScience
  
  Plasmid, as a mobile genetic element, plays a pivotal role in facilitating the transfer of traits, such as antimicrobial resistance, among the bacterial community. Annotating plasmid-encoded proteins with the widely used Gene Ontology (GO) vocabulary is a fundamental step in various tasks, including plasmid mobility classification. However, GO prediction for plasmid-encoded proteins faces two major challenges: the high diversity of functions and the limited availability of high-quality GO annotations. Thus, we introduce PlasGO, a tool that leverages a hierarchical architecture to predict GO terms for plasmid proteins. PlasGO utilizes a powerful protein language model to learn the local context within protein sentences and a BERT model to capture the global context within plasmid sentences. Additionally, PlasGO allows users to control the precision by incorporating a self-attention confidence weighting mechanism. We rigorously evaluated PlasGO and benchmarked it against six state-of-the-art tools in a series of experiments. The experimental results collectively demonstrate that PlasGO has achieved commendable performance. PlasGO significantly expanded the annotations of the plasmid-encoded protein database by assigning high-confidence GO terms to over 95% of previously unannotated proteins, showcasing impressive precision of 0.8229, 0.7941, and 0.8870 for the three GO categories, respectively, as measured on the novel protein test set.
  
  This work has been peer reviewed in GigaScience (see paper), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer name: **Nguyen Quoc Khanh Le **
  
  Review content: 1. The manuscript introduces PlasGO, which leverages a hierarchical architecture for GO term prediction in plasmid-encoded proteins. However, the novelty of the approach could be questioned. While the combination of protein language models and BERT for GO prediction is innovative, similar methods have been applied in other contexts. 2. The study heavily relies on data from the RefSeq database, yet there is limited discussion on the quality and completeness of this data. The manuscript should address potential biases introduced by incomplete or incorrect GO annotations in the database. Moreover, the study uses protein sequences of up to 1K length, which might exclude relevant longer sequences, potentially limiting the model's applicability to all plasmid-encoded proteins. 3. The manuscript claims that PlasGO can generalize well to novel proteins, but this claim is based on a specific dataset. The model's generalizability should be tested on more diverse and independent datasets, including plasmids from different bacterial species or environmental contexts. 4. While the model's performance is quantitatively evaluated, the interpretability of the results remains unclear. The study should include an analysis of how well the model's predictions align with known biological functions and pathways. Additionally, it would be helpful to include examples where PlasGO provides novel insights that other models do not, thereby demonstrating its practical utility. 5. The manuscript does not provide detailed information on the computational resources required to train and run PlasGO. Given the complexity of the model, there are potential concerns about its scalability, particularly for larger plasmid datasets or in settings with limited computational power. 6. The manuscript compares PlasGO with several state-ofthe-art tools, but the comparison might not be fully exhaustive. Additionally, statistical significance tests for performance differences should be provided to support the comparative analysis. 7. Language models have been used in previous bioinformatics studies i.e., PMID: 37381841, PMID: 38636332. Therefore, the authors are suggested to refer to more works in this description to attract a broader readership. 8. The study should discuss any ethical considerations related to the use of public datasets, particularly regarding data privacy and consent if any sensitive data is involved. Furthermore, the potential commercial implications of the PlasGO tool, especially if it is used for proprietary research, should be addressed. 9. While the manuscript mentions that PlasGO's code will be made available, it is crucial to ensure that all aspects of the research are fully reproducible. 10. The hierarchical architecture and the use of extensive training data might lead to overfitting, especially given the high dimensionality of the input features. The manuscript should discuss the measures taken to prevent overfitting, such as regularization techniques, dropout, or cross-validation strategies. 11. The study could benefit from a more detailed discussion on the practical implications of using PlasGO in real-world plasmid research. How can this tool be integrated into existing workflows for plasmid function prediction? What are the potential limitations in practical applications?
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2024.07.03.602011v1
Dec 2024
www.biorxiv.org www.biorxiv.org

NeuroVar: An Open-source Tool for Gene Expression and Variation Data Visualization for Biomarkers of Neurological Diseases

2
1. GigaScience 26 Dec 2024
  
  in GigaByte
  
  Editors Assessment:
  
  Coded and written up as part of the African Society for Bioinformatics and Computational Biology (ASBCB) Omicscodeathons, NeuroVar is a new tool for visualizing genetic variation (Single nucleotide polymorphisms and insertions/deletions) and gene expression data related to neurological diseases. The open source R-tool is available as an online Shiny Application and a desktop application that does not require any computational skills to use. Initial validation and case studies for the tool present analyses of biomarkers in ALS, exemplifying its relevance in personalized medicine and genomic discovery. Being an Open Source project, after peer review more detail has been added in paper and GitHub repo on how to contribute, report issues or seek support. Alongside some improved installation guidelines. The paper states future developments will expand its dataset beyond the ClinGen database to encompass new databases and broader genetic inquiries.
  
  *This evaluation refers to version 1 of the preprint *
2. GigaScience 26 Dec 2024
  
  in GigaByte
  
  AbstractBackground The expanding availability of large-scale genomic data and the growing interest in uncovering gene-disease associations call for efficient tools to visualize and evaluate gene expression and genetic variation data.Methodology Data collection involved filtering biomarkers related to multiple neurological diseases from the ClinGen database. We developed a comprehensive pipeline that was implemented as an interactive Shiny application and a standalone desktop application.Results NeuroVar is a tool for visualizing genetic variation (single nucleotide polymorphisms and insertions/deletions) and gene expression profiles of biomarkers of neurological diseases.Conclusion The tool provides a user-friendly graphical user interface to visualize genomic data and is freely accessible on the project’s GitHub repository (https://github.com/omicscodeathon/neurovar).
  
  This work has been published in GigaByte Journal under a CC-BY 4.0 license (https://doi.org/10.46471/gigabyte.143). These reviews are as follows.
  
  **Reviewer 1. Joost Wagenaar **
  
  Is there a clear statement of need explaining what problems the software is designed to solve and who the target audience is?
  
  Yes. There is a clear statement of need, but the audience is not very targeted. The investigators outline the need for tools to help users identify phenotypic subtypes of disease and describe how the tool would help with this. Although the investigators mention that the tool will allow users to analyze biomarker data, the scope of the types of analysis that can be performed is relatively small. I think that it would benefit the tool to better define the targeted users (clinicians, data scientists, enthusiasts?) and develop specifically towards a single audience.
  
  The tool leverages several existing R packages to run the analysis over the data and the provided tool can be described as a user-friendly wrapper around these libraries. The interface allows users to submit a file, and plot the results of the analysis within the app.
  
  As Open Source Software are there guidelines on how to contribute, report issues or seek support on the code?
  
  No. I did not see any guidelines for contributing to the project in the paper, or in the associated GitHub repository.
  
  Is the documentation provided clear and user friendly?
  
  Yes, the investigators did a great job providing documentation and installation instructions. [also video demo: https://youtu.be/cYZ8WOvabJs?si=DnxVuL65yr0wYYjq]
  
  Is there a clearly-stated list of dependencies, and is the core functionality of the software documented to a satisfactory level?
  
  Yes, the investigators provide a clearly-stated list of dependencies and instructions on how to install them prior to running the application. Is test data available, either included with the submission or openly available via cited third party sources (e.g. accession numbers, data DOIs)?
  
  Yes. The paper, and GitHub repository point to a public dataset that can be used to test the application.
  
  Are there (ideally real world) examples demonstrating use of the software?
  
  Yes. The investigators provide a video highlighting the use of the application and provide a use-case where they use the app to validate some existing knowledge.
  
  Is automated testing used or are there manual steps described so that the functionality of the software can be verified?
  
  No. The application is sufficiently small that no automated testing or manual testing would necessary be required beyond validating that the application works.
  
  Additional Comments:
  
  The proposed application provides a nice tool that makes visualization of vcf data and analysis easier for users who are not comfortable working within R directly. It provides a nice demonstration how the scientific community can wrap scientific tools into deployable applications and tools that can be easily understood. A question remains on the target audience for an application like this as most people who are interested in these type of analysis and visualizations are, in fact, familiar enough with R, or other programming languages to directly leverage the libraries and plot the results.
  
  That said, as data integration and multi-omics visualization becomes more complex and the app provides more ways to visualize the data in meaningful ways, I do strongly believe that applications like this can provide a meaningful addition to the scientific tools that are available.
  
  Reviewer 2. Ruslan Rust
  
  Is the language of sufficient quality? Yes. The language quality of the document is of sufficient quality. I did not notice any major issues.
  
  Is there a clear statement of need explaining what problems the software is designed to solve and who the target audience is?
  
  Yes. Yes, authors provide a statement of need. Authors mention that there is the need for a specialized software tool to identify genes from transcriptomic data and genetic variations such as SNPs, specifically for neurological diseases. Perhaps authors could expand on how they chose the diseases. E.g. stroke is not listed among the neurological diseases. Perhaps authors could expand a bit on the diseases they chose in the introduction.
  
  Is the source code available, and has an appropriate Open Source Initiative license (https://opensource.org/licenses) been assigned to the code?
  
  Yes the source code is available in github under the following link: https://github.com/omicscodeathon/neurovar. Additionally authors deposited the source code and additional supplementary data in a permanent depository with zenodo under the following DOI: https://zenodo.org/records/13375493. They also provided test data https://zenodo.org/records/13375591. I was able to download and access the complete set of data
  
  As Open Source Software are there guidelines on how to contribute, report issues or seek support on the code?
  
  No. I did not find any way to contribute, report issues or seek support. I would recommend that the authors add this information to the Github README file.
  
  Is the code executable?
  
  Yes, I could execute the code using Rstudio 4.3.3
  
  Is installation/deployment sufficiently outlined in the paper and documentation, and does it proceed as outlined?
  
  Yes. I could follow the installation process, but perhaps authors could add few more details how to download from Github in more detail. As some scientist may have trouble with it. Also perhaps an installation video (additionally to the video demonstration of the Neurovar Shiny App might be helpful.
  
  Is the documentation provided clear and user friendly?
  
  Yes. The documentation is provided and is user friendly. I was able to install, test and run the tool using RStudio. Authors may consider to offer also a simple website link for the RshinyTools if possible. This may enable the access also for scientists that are not familiar with R.Especially, it is great that authors provided a demonstration video. I was able to reproduce the steps. However, I would recommend to add more information into the Youtube video. E.g. reference to the preprint/ paper and Github link would be helpful to connect the data.Perhaps authors could also expand a bit on the possibilities to export data from their software. And provide different formats e.g., PDF / PNG /JPEG. I think this is important for many researchs to export their outputs e.g., from the heatmaps.
  
  Is there a clearly-stated list of dependencies, and is the core functionality of the software documented to a satisfactory level?
  
  Yes, dependencies are listed and are installed automatically. It worked for me with Rstudio version 4.3.3. In the manuscript and in the repository.
  
  Is test data available, either included with the submission or openly available via cited third party sources (e.g. accession numbers, data DOIs)?
  
  Yes the authors provide test data with this doi: https://doi.org/10.5281/zenodo.13375590
  
  Are there (ideally real world) examples demonstrating use of the software?
  
  Yes, authors use the example of Epilepsy, focal epilepsy and the gene of interest DEPDC5. I replicated their search and got the same results. However, I find that the label in Figure 1 in the gene’s transcript could be a bit more clear. E.g. it is not clear to me what transcript start and end refers to. It might also be more helpful if authors provide an example dataset for the Expression data that is loaded in the software by default.Furthermore authors use a case study results using RNAseq in ALS patients with mutations in FUS, TARDBP, SOD1, VCP genes.
  
  Is automated testing used or are there manual steps described so that the functionality of the software can be verified?
  
  No. Automated testing is not used as far as I can access it.
  
  Additional Comments: The preprint version of this paper was also reviewed in ResearchHub: https://www.researchhub.com/paper/7381836/neurovar-an-open-source-tool-for-gene-expression-and-variation-data-visualization-for-biomarkers-of-neurological-diseases/reviews
  
  My expertise: I am assistant professor in neuroscience and physiology at University of Southern California and work on stem cell therapies on stroke. We are particularly interested in working with genomic data and the development of new biomarkers for stroke, AD and other neurological diseases.
  
  Summary: The authors provide a software tool NeuroVar that helps visualizing genetic variations and gene expression profiles of biomarkers in different neurological diseases.
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2024.08.21.609056v1
www.biorxiv.org www.biorxiv.org

demuxSNP: supervised demultiplexing scRNAseq using cell hashing and SNPs

2
1. GigaScience 05 Dec 2024
  
  in GigaScience
  
  algorithm are used to train a KNN classifier that predicts the demultiplexing classes of unassigned or uncertain cells. We benchmark demuxSNP against hashing (HTODemux, cellhashR, GMM-demux, demuxmix) and genotype-free SNP (souporcell) methods on simulated and real data from renal cell cancer. Our results demonstrate that demuxSNP outperformed standalone hashing methods on low quality hashing data, improving overall classification accuracy and allowing more high RNA quality cells to be recovered. Through varying simulated doublet rates, we show genotype-free SNP methods are unable to identify biological samples with low cell counts at high doublet rates. When compared to unsupervised SNP demultiplexing methods, demuxSNP’s supervised approach was more robust to doublet rate in experiments with class size imbalance.Conclusions demuxSNP is a performant demultiplexing approach that uses hashing and SNP data to demultiplex datasets with low hashing quality where biological samples are genetically distinct. Unassigned cells (negatives) with high RNA quality can be recovered, making more cells available for analysis, especially when applied to data with low hashing quality or suspected misassigned cells. Pipelines for simulated data and processed benchmarking data for 5-50% doublets are publicly available. demuxSNP is available as an R/Bioconductor package (https://doi.org/doi:10.18129/B9.bioc.demuxSNP).
  
  Reviewer 2: Haynes Heaton Reviewer Comments: demuxSNP is a tool for combining the demultiplexing capabilities of hashtagging and SNP based genotype demultiplexing of scRNAseq with cells from individuals mixed for cost savings and batch effect reduction. The authors test this method in comparison with other methods for either hashtag demultiplexing or genotype based demultiplexing individually and show improvements in recovering cells not confidently assigned via hashtagging as well as overcoming cases where genotype demultiplexing fails.comments on results:For figure 2 this is mostly this is good for recovering low hash quality cells. Although because the low quality hashing has been simulated in order to have a ground truth to compare to, it is unclear if this simulation method or amount realistic? Does it compare to % unassigned from real datasets?For figure 3 my main issue is why would souporcell out perform demuxSNP at any % doublets? Souporcell is using strictly less information than demuxSNP because it does not assume hashtags. Ideally this would be fixed or at the very lease an adequate explanation given.Comments on methods:"SNPs are filtered to those located within genes expressed across most cells in thedataset" and "SNPs with few reads across cells in the dataset are removed." -- Can I get numbers on this? If you require say 50% of cells to express a SNP locus, it will throw out a huge amount of the still informative SNPs. I find that utilizing as much of the data as possible is generally better. I assume this is done because of the KNN method which will require high overlap in SNPs between cells being compared."Labels from high confidence singlets along with simulated doublets used to train KNNclassifier and predict negative/uncertain cells." Why a KNN model here. Genotype data is not euclidean. Each SNP locus for each cell should be drawn as a binomial with underlying p=0+some error (homozygous ref) p=0.5+/- some error (heterozygous), or p=1.0-some errorA statistical model would be more appropriate for this."To leverage classification techniques applicable to binary data, SNP status is recoded toabsent/present (1,0) and k-nearest-neighbour classification (KNN) [31] is performed usingJaccard coefficient." Ah, so you force the data to be euclidean, but this does not take full advantage of the data. One problem with this will be when two individuals are related. For SNP loci of a parent/child there are many cases where this potentially could have disambiguated them but wont because one individual is heterozygous (so snp present) and the other is homozygous alt (still snp present).General comments: these are small nitpicksThe primary failure modes of genotype demultiplexing in no particular order are 1. small number of cells in a minority cluster 2. large number of individuals multiplexed together. and 3. large number of doublets causing lots of noise in the statistical models. The authors have adequately addressed improvements in 1 and 3. However, I think the paper would be stronger if it also did experiments with >30 individuals multiplexed together. For 3, I think further discussion is merited on the tradeoffs of hyperloading scRNAseq protocols including the # of quality singletons vs loading rate and multiplet rate and how many multiplets escape detection. Experiment designers want to maximize the number of singletons while minimizing the number of doublets that escape detection and harm downstream analyses. 10x genomics gives the ballpark doublet % to be expected as 1% per 1000 cells recovered. But this is a poisson loading process, so the true effect is not linear. The authors test up to 50% doublets (which is good to test), and some experimenters do attempt to load enough to recover 50k cells from a single lane, but I doubt that would be a recommended loading level for downstream analysis unless the doublet detection is highly effective.
2. GigaScience 05 Dec 2024
  
  in GigaScience
  
  Background Multiplexing single-cell RNA sequencing experiments reduces sequencing cost and facilitates larger scale studies. However, factors such as cell hashing quality and class size imbalance impact demultiplexing algorithm performance, reducing cost effectivenessFindings We propose a supervised algorithm, demuxSNP, leveraging both cell hashing and genetic variation between individuals (SNPs). The supervised algorithm addresses fundamental limitations in demultiplexing with only one data modality. The genetic variants (SNPs) of the subset of cells assigned with high confidence using a probabilistic hashing
  
  Reviewer 1: Lei Li Reviewer Comments: Lynch et. al developed demuxSNP, a supervised demultiplexing approach for single-cell cell hashing data in a multi-modal (hashtag expression and SNP profiles) fashion. They utilized a probabilistic method to infer sample identities of cells using cell hashing modality, and then build a KNN model using SNPs of high cofinance ones from previous step. They then use this KNN model to predict cell identities for cells assigned as uncertain or negative by cell hashing.They have demonstrated the performance through a comparison with existing single-modal methods using both real data and simulated data. They have published an R package for the research community. It is interesting and encouraging to see another study focuses on multi-modal demultiplexing for cell hashing data. Below are some major and minor points from my side:1. I am not surprised that a multi-modal demultiplexing beats single-modal methods across both real and simulated datasets. To my knowledge, there are at least two groups proposed multi-modal demultiplexing approach for cell hashing data. Both were uploaded to bioRxiv last year and get published recently. One called hadge (https://link.springer.com/article/10.1186/s13059-024-03249-z ), and another called HTOreader hybrid (https://academic.oup.com/bib/article/25/4/bbae254/7686601), which is discussed by this study. Hadge is a comprehensive framework that integrated popular cell hashing-based and SNP-based methods, allowing for a joint deconvolution by combining best method from each modality. HTOreader hybrid proposed an improved demultiplexing method for cell hashing signals, and then also integrates demultiplexing results from both modality for a better deconvolution in a hybrid fashion. Indeed, this work has implemented different method for the same purpose. I tried both methods, and there're some major updates between bioRxiv version and published version. Thus, even one of them has been discussed, I think it's still necessary to include these two published methods into comparison, to reveal pros and cons of different methods, therefore provide useful information for users to select the method according to their specific experiment configurations.2. demuxSNP method picked top N commonly expressed genes for SNP calculation. In the tutorial on Github, the N was set to 100. I am wondering in a more heterozygous dataset, the N = 100 still sufficient or not. Is there a way for users to determine the N for their specific dataset more systematically? Or the authors can show some data to demonstrate that N = 100 is robust across different datasets?3. The dataset GSE267835 is private. Please provide reviewer token in the Data Availability statement during submission process.4. Color of uncertain cells in Fig1-B is a bit misleading cause in Fig1-A the same color was used to represent "background staining". Even A and B and different panels, however, a big black arrow makes readers thought they're the same data. Therefore, change the color of uncertain cells into another color would be good to avoid confusions.5. In Fig2-A and B, what are the units for the X axis? Are they log2 or log2 hashtag counts? Please add that information to the figure and legend.6. For Fig-2 C and D, please use the formal spell of names of existing methods like you did in Fig2E.7. Please add line numbers to the draft for reviewers' convenience8. Some minor format issues exist. For example, the "Result" section should a header format instead of normal text.
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2024.04.22.590526v1
www.biorxiv.org www.biorxiv.org

stMMR: accurate and robust spatial domain identification from spatially resolved transcriptomics with multi-modal feature representation

2
1. GigaScience 05 Dec 2024
  
  in GigaScience
  
  performance of stMMR in multiple analyses, including spatial domain identification, pseudo-spatiotemporal analysis, and domain-specific gene discovery. In chicken heart development, stMMR reconstruct the spatiotemporal lineage structures indicating accurate developmental sequence. In breast cancer and lung cancer, stMMR clearly delineated the tumor microenvironment and identified marker genes associated with diagnosis and prognosis. Overall, stMMR is capable of effectively utilizing the multi-modal information of various SRT data to explore and characterize tissue architectures of homeostasis, development and tumor.
  
  Reviewer 2: Hongzhi Wen Reviewer Comments: The paper introduces stMMR, a multi-modal graph learning method designed to integrate gene expression, spatial location, and histological information for accurate spatial domain identification from spatially resolved transcriptomics (SRT) data. The method employs graph convolutional networks (GCN) and self-attention modules, along with cross-modal contrastive learning, to enhance feature integration and representation.Strengths:1. Using GCN to capture local spatial dependency is natural and effective. Introducing attention mechanism for capturing global relations intuitively make senses, however, need more justification. Contrastive learning for cross-modal feature fusion is also a natural choice in multimodal learning. Overall, the methodology is novel and solid.2. Extensive benchmark analysis across various types of spatial data and tissues demonstrates superior performance of the method in spatial domain identification, pseudo-spatiotemporal analysis, and domain-specific gene discovery. The empirical evidence is very convincing.3. The method's application to chicken heart development, breast cancer, and lung cancer showcases its potential in reconstructing spatiotemporal lineage structures and delineating tumor microenvironments, highlighting its value in clinical research.Weaknesses:1. In Figure 4, SpaceFlow is the only baseline for the case study. However, the performance of SpaceFlow is not topranked in other experiments. There should be a justification for why SpaceFlow is highlighted here.2. The contribution of the global attention mechanism to the whole framework is not very clear. The authors may provide more intuition and empirical justification (e.g., ablation study) if they would like to highlight this design.3. By introducing the hyperparameters $\alpha$, $\beta$ and $\gamma$ in Eq. (11), the method has a significantly larger search space than other methods. It is important to note how these hyperparameters are chosen in practice, more importantly, whether the test performance is referred when adjusting these hyperparameters. This might result in an unfair evaluation.
2. GigaScience 05 Dec 2024
  
  in GigaScience
  
  AbstractDeciphering spatial domains using spatially resolved transcriptomics (SRT) is of great value for the characterizing and understanding of tissue architecture. However, the inherent heterogeneity and varying spatial resolutions present challenges in the joint analysis of multi-modal SRT data. We introduce a multi-modal geometric deep learning method, named stMMR, to effectively integrate gene expression, spatial location and histological information for accurate identifying spatial domains from SRT data. stMMR uses graph convolutional networks (GCN) and self-attention module for deep embedding of features within unimodal and incorporates similarity contrastive learning for integrating features across modalities. Comprehensive benchmark analysis on various types of spatial data shows superior
  
  Reviewer 1: Shihua Zhang Reviewer Comments: In this paper, the authors developed a multi-modal deep learning method for identifying spatial domains from ST data by integrating gene expression, spatial location and histological information. This method adopts the graphconvolutional networks and self-attention module for deep embedding of features within unimodal and incorporates similarity contrastive learning for integrating features across modalities. They did several typical analysis to valid this this method. Generally, the wiring of this paper is OK. More specific comments:1. Spatial domain has been overwhelmingly studied recently. The authors need to pay more attention to why it is needed to introduce a new method. The novelty of the current method should be carefully clarified. For example, how the histological information help to improve the performance? Does the "geometric" deep learning really help?2. This method has been applied to some stereotypical data. The authors should applied it to some recently generated data by some new ST techniques.3. Figure 3 stMMR enhances spatial gene expression profiles. It is hard to see how the method enhance the spatial gene expression (e.g., LPL).4. With the accumulation of multi-slice spatial transcriptome data, the integration and alignment of spatial transcriptome data will be essential. Can this method be extended for this situation like STAGATE (Nat Comput Sci.2023 Oct; 3(10):894-906)? This will be valuable for ST analysis.5. The scalability of this method should be carefully explored.6. The authors should provide a detailed tutorial for users.
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2024.02.22.581503v2
www.biorxiv.org www.biorxiv.org

De novo assembly and characterization of a highly degenerated ZW sex chromosome in the fish Megaleporinus macrocephalus

2
1. GigaScience 05 Dec 2024
  
  in GigaScience
  
  Conclusions The chromosome-level genome of piauçu exhibits high quality, establishing a valuable resource for advancing research within the group. Our discoveries offer insights into the evolutionary dynamics of Z and W sex chromosomes in fish, emphasizing ongoing degenerative processes and indicating complex interactions between Z and W sequences in specific genomic regions. Notably, amhr2 and bmp7 are potential candidate genes for sex determination in M. macrocephalus.
  
  Reviewer 2: Changwei Shao Reviewer Comments: The authors reported the M. macrocephalus reference genome with a highly degenerated ZW sex chromosome and analyzed the expression pattern of sex chromosomes. In a word, this work extends our understanding of the mechanisms of sex chromosome evolution of fish species. The interpretation of the results is sound for the most part, and gives enough proof to verify their results. I just have few concerns as followed.1.On line 54, please confirm it. In the tongue-sole, the size of Z chromosome (21.91Mb) is larger than the W chromosome(16.45Mb).2.On line 88, 89 and 116, the numbers mentioned do not correspond with the results in Figure 1A. Please confirm it.3.In the section on "Gene Prediction and Annotation", a more comprehensive prediction of gene structure can be achieved by combining three methods: de novo prediction, transcriptome prediction, and homology prediction. The results obtained from these three approaches can be integrated using the EVM software, followed by annotation assessment with BUSCO. The method section is somewhat vague and lacks clear logic. For protein prediction, it is advisable to utilize multiple databases, such as SwissProt, InterPro, and Nr, to corroborate evidence from various sources.4.On line 210, there is an error in the caption of Figure 3. Figure 3B should be a colinearity map of the linkage groups and chromosomes.5.The SNP sites identified in females may include those from the Z chromosome, linkage group 23 (LG23) will contain SNP information from both the Z and W chromosomes. This could potentially affect the demarcation of the region of sex conflict.6.On the sex chromosomes, are there candidate genes related to sex differentiation in regions with a high enrichment of specific SNPs? please provide a detailed explanation.7.What is the distribution of genes in the Z and W chromosome-specific regions, and what is the gene loss rate?
2. GigaScience 05 Dec 2024
  
  in GigaScience
  
  AbstractBackground Megaleporinus macrocephalus (piauçu) is a Neotropical fish within Characoidei that presents a well-established heteromorphic ZZ/ZW sex-determination system and thus, constitutes a good model for studying W and Z chromosomes in fishes. We used PacBio reads and Hi-C to assemble a chromosome-level reference genome for M. macrocephalus. We generated family segregation information to construct a genetic map, pool-seq of males and females to characterize its sex system, and RNA-seq to highlight candidate genes of M. macrocephalus sex determination.Results M. macrocephalus reference genome is 1,282,030,339 bp in length and has a contig and scaffold N50 of 5.0 Mb and 45.03 Mb, respectively. Based on patterns of recombination suppression, coverage, Fst, and sex-specific SNPs, three major regions were distinguished in the sex chromosome: W-specific (highly differentiated), Z-specific (in degeneration), and PAR. The sex chromosome gene repertoire was composed of genes from the TGF-β family (amhr2, bmp7) and Wnt/β-catenin pathway (wnt4, wnt7a), and some of them were differentially expressed.
  
  Reviewer1: Yusuke Takehana Reviewer Comments: The authors assembled a chromosome-level genomic sequence and identified the sex chromosomes of the fish Megaleporinus macrocephalus. This manuscript is potentially interesting because evolution of sex chromosomes and sex-determining genes are one of the most fundamental and popular topics in the evolutionary biology. However, the conceptual advance and the novelty of this study are quite limited. It is another paper adding now one more species to the list of assembled genomes in this fish family. In addition, there is nothing new about the description of the sex chromosomes such as their degenerative signature. Such studies have already been conducted many times and similar conclusions have been reported. Furthermore, the experimental evidence presented appears rather preliminary and is not sufficient to support the claims and interpretations presented in discussion. I am therefore afraid that I have to say that the manuscript does not provide new insights into evolution of sex chromosomes, and thus will not be of sufficient interest to the readers of Gigascience.1. Overall, the paper was very difficult to read due to a lack of logic structure and many errors, such as confusing between males and females, between chromosomes and linkage groups, and so on.2. The introduction is not logically written. It is unclear what is known and to what extent, and why the genome of this species is being determined.3. I did not understand why the authors concluded that Chr13 is the W chromosome and not the Z chromosome. They should assemble the Z and W chromosomes separately and confirm them from different information. It is also unclear how they rule out the possibility that the sequences are chimeric. If they really want to reveal the evolutionary process of sex chromosomes, they should use all the data (Hi-C, linkage analysis, Pool-seq, gene information) to compare the structure of Z and W in detail, including synteny with closely related species.4. The analysis on sex chromosome gene candidates is too poor. Basic analyses have not been conducted on whether these genes are W-specific, whether they are in both Z and W, whether they have paralogs or not on autosomes, how much sequence variation there is, when and in which cells they are expressed, etc.5. All of the discussions are superficial and lacking in logic, and it is unclear what they want to discuss.6. The figures legends are poorly explained, and contain incorrect information, so I don't understand the meaning of the data at all.7. This manuscript contained many grammatical errors leading to many confusing statements, and some sentences that were grammatically correct but awkward meaning. I strongly recommend that the authors seek advice of someone with a good knowledge of English, preferably a native speaker.
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2024.03.07.583869v1
www.biorxiv.org www.biorxiv.org

CAT Bridge: An Efficient Toolkit for Gene-Metabolite Association Mining from Multi-Omics Data

2
1. GigaScience 05 Dec 2024
  
  in GigaScience
  
  Conclusions We applied CAT Bridge to experimentally obtained Capsicum chinense (chili pepper) and public human and Escherichia coli (E. coli) time-series transcriptome and metabolome datasets. CAT Bridge successfully identified genes involved in the biosynthesis of capsaicin in C. chinense. Furthermore, case study results showed that the convergent cross mapping (CCM) method outperforms traditional approaches in longitudinal multi-omics analyses. CAT Bridge simplifies access to various established methods for longitudinal multi-omics analysis, and enables researchers to swiftly identify associated gene-metabolite pairs for further validation.
  
  Reviewer2: JITENDRA KUMAR Barupal Reviewer Comments: To the authors,Thank you for the opportunity to review the manuscript GIGA-D-24-00083. The authors created a tool to predict association between genes and metabolites using various algorithms. The authors provide the tool as a web application, and as a python package. To get the reciprocal relationship between gene and metabolites, i.e. which metabolites can change which gene or vice versa, this tool can be a toolkit for the biologist or bioinformatician.The tool has application specially the relationship between changes in genes and metabolites is not direct, many complex mechanisms exist e.g. epigenetic or polymorphism. So the tool can be alternate to other available tools.Also, the manuscript brings the community focus on causal relationships instead of just correlation based relationships. The tool used temporal causality algorithms for predicting relationships between genes and metabolites.However, I recommend major revisions before publication. Here are my reasons and comments for the revisions:General issues with web accessibility and package installation :1. There are concerns about web accessibility, as indicated by web browsers flagging the connection as insecure. This may stem from geographical restrictions or the absence of HTTPS certification. Addressing these issues would ensure secure access to the server.2. Despite successful initiation of the client application from the git repository as a python module, no results were generated upon launching. It is suggested that the authors distribute the tool as a Docker image to facilitate seamless usage, eliminating concerns regarding dependencies and version compatibility.Other comments :1. There are inconsistencies regarding data preprocessing. While the manuscript mentions that the tool will handle preprocessing, it also indicates that users need to provide processed files. Clarification is needed on whether preprocessing is required. It seems, the tool required preprocessed data.2. For clarity use "causality and correlation" instead of "causality/correlation" algorithms.3.Can the tool process any new temporal numerical data series, or does it specifically filter for genes? For instance, if I provide a list of proteins along with a list of genes, will I receive the association between them? It is suggested to include this in the discussion section.4.Does the tool offer the capability to generate a causal diagram or network from these vectors, thereby providing visual support for their assertion regarding the causal relationship between metabolites and genes? If the author is working in this direction, it is suggested that information can be added in the discussion section.5. What definition of causal relationship did the author use, and could they provide a citation for their definition. Predictability or any other criteria were used for causal relationships. Please include the definition or criteria in the introduction and method section.6. What are the minimum or maximum time points (interval) for input files? e.g. will the tool work if I provide only two times points or If I provide 48 times points. Please include the information in the method section.7. What is the influence of the number of time points on the vector relationship presented in the paper? Have any studies by the authors addressed this question? Please include the results and discussion.8. Could the authors clarify which heuristic algorithm was employed for ranking the genes? Additionally, can they elaborate on how their approach to gene ranking is heuristic rather than relying on mathematical optimization or algorithmic methods? Clarification on the term "heuristic" would be beneficial.9. Could the authors offer an example from studies conducted on yeast, E. coli, or other simple organisms, demonstrating how changes in gene sequences have readily been observed to affect metabolite levels? Please include that in the results section.10. Does the tool generate a vector indicating many-to-many relationships or one-to-one relationships? In other words, does it reveal whether one gene is associated with many metabolites, and vice versa, or if it establishes a single genemetabolite relationship? Please include this in the results section. Also, in the discussion section please include examples of application of these relationships in various fields e.g. metabolic engineering or cancer metabolism.11. Table 1 compares the features of CAT Bridge with other available methods. It should encompass features provided by other tools that are not available in the author's tool, such as knowledge-driven integration or integration with a third-party database. Additionally, it should address the limitation posed by the requirement of time series data, which is not just a strength but also a challenge, particularly for epidemiology studies where multiple time series for gene expression may not be feasible.12. Please use alternative phrases to "Self-generated data," such as "experimentally obtained data," to clarify that the author is utilizing data acquired in the lab to validate the tool. (e.g. line 42, 223, and 492).
2. GigaScience 05 Dec 2024
  
  in GigaScience
  
  AbstractBackground With advancements in sequencing and mass spectrometry technologies, multi-omics data can now be easily acquired for understanding complex biological systems. Nevertheless, substantial challenges remain in determining the association between gene-metabolite pairs due to the non-linear and multifactorial interactions within cellular networks. The complexity arises from the interplay of multiple genes and metabolites, often involving feedback loops and time-dependent regulatory mechanisms that are not easily captured by traditional analysis methods.Findings Here, we introduce Compounds And Transcripts Bridge (abbreviated as CAT Bridge, available at https://catbridge.work), a free user-friendly platform for longitudinal multi-omics analysis to efficiently identify transcripts associated with metabolites using time-series omics data. To evaluate the association of gene-metabolite pairs, CAT Bridge is a pioneering work benchmarking a set of statistical methods spanning causality estimation and correlation coefficient calculation for multi-omics analysis. Additionally, CAT Bridge features an artificial intelligence (AI) agent to assist users interpreting the association results.
  
  Reviewer 1: Tara Eicher Reviewer Comments: The authors introduce a useful tool (CAT Bridge) for integrating multiple causal and correlative analyses for multi-omics integration, which also includes a visualization and LLM component. The authors further provide two case studies (human and plant) illustrating the utility of CAT Bridge. I believe that this work should be published, as it contributes to the field of multi-omics analysis.However, I am very concerned about the lack of description regarding the LLM. As explained by Mittelstadt et al (https://www.nature.com/articles/s41562-023-01744-0), LLMs do not always provide factual answers. The authors need to justify the use of the LLM to determine the relevance of a gene-metabolite association. In particular, the authors should add to the main text (or at least the supplementary) a detailed description of the prompt construction and should justify why this prompt is expected to result in factual information. Furthermore, the authors should discuss the caveats of using LLMs in this context, starting with the linked article above. I believe that the manuscript will only be publishable once this concern is addressed.In addition, the authors are recommended to address the following more minor concerns:Implementation:1. Your "example file" links at https://catbridge.work are broken. Please fix this.Abstract:1. Line 32: "Nevertheless, substantial challenges remain in determining the association between gene-metabolite pairs due to the complexity of cellular networks." This is not a clear statement. What about the complexity of cellular networks presents challenges in determining the associations?2. Make sure you are using present tense consistently, not past tense (Line 39).3. Please use the scientific name with the common name in parentheses as follows: Capsicum chinense (chili pepper). Use only the scientific name throughout the rest of the document (Line 41).Background:1. Line 56: "Background" should not be plural.2. Lines 59-60: More comprehensive than what? Please elaborate here.3. In Line 60, please include and familiarize yourself with the following reference: Eicher, T., G. Kinnebrew, A. Patt, K. Spencer, K. Ying, Q. Ma, R. Machiraju and E. A. MathÃ© (2020). "Metabolomics and Multi-Omics Integration: A Survey of Computational Methods and Resources." Metabolites 10: 202.4. Lines 67-68: Citation needed.5. Line 72: Please use the scientific name with the common name in parentheses.6. Lines 74-77: Citations needed.7. Lines 77-78: Give an example of biologically naÃ¯ve conclusions from purely data-driven strategies.8. Line 78: Discuss how the machine learning models could address the drawbacks of the correlation models and vice-versa.Materials and Methods:1. It seems that CAT Bridge needs to be run on one metabolite at a time. In this case, I would not use the term "gene-metabolite pair association" in Line 104, but rather "associations between genes and the target metabolite".2. Line 115: Clearly state which of these methods are non-linear and which address the lag issue.3. Line 136: Your figures are out of order (Figure 1B comes after Figure 2B).4. Please take a look at the Minimum Standards Reporting Checklist (https://academic.oup.com/gigascience/pages/Minimum_Standards_of_Reporting_Checklist). In particular:a. In the section starting at Line 153, list the number of seedlings used.b. Were all timepoints collected from all seedlings? List the total number of samples.c. How many mg were collected per sample (can use a range here)?d. 3 biological replicates per seedling? Give more detail here.e. What machine was used for the ultrasonic process? If frequency settings are permitted by the machine, list the settings used.f. How many of the 28 younger and 54 older adults had both transcriptome and metabolome data?5. Line 209: "Younger" and "older" are better terms.Results:1. Line 248: How does the AI agent analyze the functional annotations?2. Lines 281-282: "This illustrates the advantage of causal relationship modeling methods over traditional methods".3. Line 290: Please also include the updated IntLIM paper (IntLIM 2.0): Eicher, T., K. D. Spencer, J. K. Siddiqui, R. Machiraju and E. A. Mathe (2023). "IntLIM 2.0: identifying multi-omic relationships dependent on discrete or continuous phenotypic measurements." Bioinformatics Advances 3(1): vbad009.4. Make sure the colors are consistent in Table 1.5. Line 156: The scientific name of the pepper species is inconsistent with other areas of the text.Figures:1. S1 should be provided as a table, not a figure.2. Please make S2 larger. It is difficult to read.3. S3 needs labels (x axis, y axis, legend).
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2024.01.21.576587v3
Nov 2024
www.biorxiv.org www.biorxiv.org

Discovering genotype-phenotype relationships with machine learning and the Visual Physiology Opsin Database (VPOD)

3
1. GigaScience 23 Nov 2024
  
  in GigaScience
  
  AbstractBackground Predicting phenotypes from genetic variation is foundational for fields as diverse as bioengineering and global change biology, highlighting the importance of efficient methods to predict gene functions. Linking genetic changes to phenotypic changes has been a goal of decades of experimental work, especially for some model gene families including light-sensitive opsin proteins. Opsins can be expressed in vitro to measure light absorption parameters, including λmax - the wavelength of maximum absorbance - which strongly affects organismal phenotypes like color vision. Despite extensive research on opsins, the data remain dispersed, uncompiled, and often challenging to access, thereby precluding systematic and comprehensive analyses of the intricate relationships between genotype and phenotype.Results Here, we report a newly compiled database of all heterologously expressed opsin genes with λmax phenotypes called the Visual Physiology Opsin Database (VPOD). VPOD_1.0 contains 864 unique opsin genotypes and corresponding λmax phenotypes collected across all animals from 73 separate publications. We use VPOD data and deepBreaks to show regression-based machine learning (ML) models often reliably predict λmax, account for non-additive effects of mutations on function, and identify functionally critical amino acid sites.Conclusion The ability to reliably predict functions from gene sequences alone using ML will allow robust exploration of molecular-evolutionary patterns governing phenotype, will inform functional and evolutionary connections to an organism’s ecological niche, and may be used more broadly for de-novo protein design. Together, our database, phenotype predictions, and model comparisons lay the groundwork for future research applicable to families of genes with quantifiable and comparable phenotypes.Key PointsWe introduce the Visual Physiology Opsin Database (VPOD_1.0), which includes 864 unique animal opsin genotypes and corresponding λmax phenotypes from 73 separate publications.We demonstrate that regression-based ML models can reliably predict λmax from gene sequence alone, predict non-additive effects of mutations on function, and identify functionally critical amino acid sites.We provide an approach that lays the groundwork for future robust exploration of molecular-evolutionary patterns governing phenotype, with potential broader applications to any family of genes with quantifiable and comparable phenotypes.Competing Interest StatementThe authors have declared no competing interest.
  
  Reviewer 3. Fabio Cortesi
  
  In their manuscript, Frazer et al. developed a machine-learning approach to predict the spectral sensitivity of a visual pigment based on the gene/amino acid sequence of the opsin protein. First, they created a visual opsin database based on heterologously expressed genes from the literature. They then used deepBreaks, an ML tool developed to explore genotype-phenotype associations, to run several different models and test how well ML could predict spectral sensitivity. Their main findings are that the larger the dataset for training and the more diverse (both in opsin sequences themselves and phylogenetic breadth they were derived from) the dataset, the better the predictions will become. However, there is a plateau for the number of training sequences that should be used as a minimum (~ 200), with a slight gain afterwards. As such, the suggested ML approach works well for larger datasets but needs refining for smaller datasets. There are also several drawbacks to the approach that need to be carefully considered when interpreting the results, including the fact that ML cannot accurately predict the effect on phenotype if confronted with a new mutation or a new combination of mutations not used during training.
  
  I found the study to be well-written and easy to follow. The results support the conclusions, and as far as I can tell, the ML and associated analysis were performed accurately. All the code and the database are readily accessible, too. It is great to see that we are at a point now where computational power has reached a level that can be used to predict gene-phenotype relationships accurately. The use of ML to study the function of (visual) opsins, i.e., spectral sensitivity, especially if additional parameters can be included, will undoubtedly be of great help to the field and welcomed by the community. As such, I have no major concerns and only a few minor comments I recommend addressing before publication.
  
  Minor comments
  
  Introduction - Please add a sentence to explain that a visual pigment consists of an opsin protein bound to a chromophore/retinal and that the two units together lead to the 'spectral sensitivity' phenotype. You cover it in the discussion, but it would be helpful for the reader to have this information upfront.
  
  Please provide a reference for the following statement: '[…], and purification of heterologously expressed opsins followed by spectrophotometry [REF]'.
  
  You say, 'Despite opsins being a well-studied system with an extensive backlog of published literature, previous authors expressed doubts that sequence data alone can provide reliable computational predictions of λmax phenotypes [37-40]'.
  
  I agree that the spectral sensitivity predictions from sequences have been criticised in the past as they were sometimes oversimplified (including some of our work). However, spectral sensitivity predictions based on computational modelling, albeit not using ML, have previously been attempted successfully several times, e.g., by Jagdish Suresh Patel and colleagues, and should be mentioned here.
  
  You say that: 'The extensive data on animal opsin genotype-phenotype associations remains disorganized, decentralized, often in non-computer readable formats in older literature, and under-analyzed computationally'.
  
  Again, I agree that the opsin data can profit from a centralised databank like the one you created. However, there have been several previous attempts at summarizing opsin data in recent years (although not specific for heterologously expressed opsins), for vertebrates at least. For example, work by Schweikert and colleagues on fish visual opsins and recent work on frog opsins by Schott et al. These studies should be mentioned and cited appropriately here, as tremendous work went into collating the datasets in the first place.
  
  Results
  
  The use of MWS opsin is somewhat confusing. I presume this refers to vertebrate lws genes that are mid-wavelength shifted? Why have these as a separate group? Ancestrally, there are five sub-families of visual opsin genes in vertebrates: sws1 & sws2 (SWS), rh1, rh2 and lws (MWS & LWS). The MWS range in Figure 1 should be part of a larger lws derived grouping.
  
  This part reads like a discussion. It also needs a reference for the age of T1 opsins: 'The similar levels of performances between T1 and invertebrate models were unexpected, especially considering it has a training dataset five times larger than the invertebrate model. One possible explanation is that the very old age of T1 opsins [REF] might have led to a higher complexity of genotype-phenotype associations that are not yet well sampled enough to allow good predictions.'
  
  These two sentences could also be weaved into the discussion rather than the results section: 'These equations do not account directly for taxonomic, genetic, or phenotypic diversity, as the number of genes is on the x-axis. Therefore, one should be cautious about applying them to predict model performance based on training data size alone.'
  
  Table 1: What do MAPE and RMSE stand for, and what do those numbers mean? Maybe also include a short explanation of the acronyms and their meaning in the main body of the text.
  
  This should also be mentioned in the discussion: 'Until the models are trained with more invertebrate (r-opsin) data, we do not put high confidence in the estimates of λmax.'
  
  Figure 2 legend: Third line, why 'Mutant predictions …'? Aren't the predictions for all sequences?
  
  Figure 3 legend: It says 547 mutant sequences here and 546 sequences in Table 1.
  
  Provide a reference for the following sentence: 'The WT SWS/UVS model similarly highlighted p113, a site functionally characterized as the counterion in the retinal-opsin Schiff base interaction for all vertebrate opsins.
  
  Figure 4 legend: Please provide references for the following sentence: 'Positions 181, 261 and 308 are highlighted because they are among the highest scoring sites and have all been previously characterized as functionally important to opsin phenotype and function.'
  
  Discussion
  
  Please simplify and do not overstate the first sentence. I suggest: 'To better understand methods to connect genes and their functions, we initiated VPOD, a database of opsin genes and corresponding spectral sensitivity phenotypes.'
  
  Section: The important relationship between data availability and predictive power.
  
  You mention that ML could not accurately predict spectral sensitivity if mutant genes were excluded, especially if smaller datasets are used. This was to be expected since ML is not per-se 'smart' but learns from patterns in the underlying dataset. However, it is a significant drawback of the approach, and I encourage you to state this more clearly. My main concern is that future users will take the ML predictions as absolute truth instead of verifying or experimentally verifying the predictions.
  
  Provide a reference for the following sentence: 'One consequence of leaf-based tree construction is that due to its faster convergence/training time, it can be more prone to overfitting, as it constructs trees on a 'best-first basis' with a fixed number of n-terminal nodes.'
  
  You should include some information regarding the assumptions in the Introduction and the Methods section. For example, information about what chromophore interaction was modelled should be in the methods, and the basic information about how visual pigments are formed and what different chromophore types are being used by which species should be in the Introduction: 'We also assume the photopigment uses 11-cis-retinal, as all heterologously expressed opsins in VPOD were reconstituted using this chromophore. However, this assumption is violated in some organisms because they use 13-cis-retinal as the in-vivo chromophore [71-73], which is associated with a red-shift in λmax [32,71].'
  
  Conclusion
  
  I recommend being more cautious about the predictive power for epistatic effects since you tested it only on three samples and the predictions were severely restricted by the training dataset containing the single mutant samples.
2. GigaScience 23 Nov 2024
  
  in GigaScience
  
  AbstractBackground Predicting phenotypes from genetic variation is foundational for fields as diverse as bioengineering and global change biology, highlighting the importance of efficient methods to predict gene functions. Linking genetic changes to phenotypic changes has been a goal of decades of experimental work, especially for some model gene families including light-sensitive opsin proteins. Opsins can be expressed in vitro to measure light absorption parameters, including λmax - the wavelength of maximum absorbance - which strongly affects organismal phenotypes like color vision. Despite extensive research on opsins, the data remain dispersed, uncompiled, and often challenging to access, thereby precluding systematic and comprehensive analyses of the intricate relationships between genotype and phenotype.Results Here, we report a newly compiled database of all heterologously expressed opsin genes with λmax phenotypes called the Visual Physiology Opsin Database (VPOD). VPOD_1.0 contains 864 unique opsin genotypes and corresponding λmax phenotypes collected across all animals from 73 separate publications. We use VPOD data and deepBreaks to show regression-based machine learning (ML) models often reliably predict λmax, account for non-additive effects of mutations on function, and identify functionally critical amino acid sites.Conclusion The ability to reliably predict functions from gene sequences alone using ML will allow robust exploration of molecular-evolutionary patterns governing phenotype, will inform functional and evolutionary connections to an organism’s ecological niche, and may be used more broadly for de-novo protein design. Together, our database, phenotype predictions, and model comparisons lay the groundwork for future research applicable to families of genes with quantifiable and comparable phenotypes.Key PointsWe introduce the Visual Physiology Opsin Database (VPOD_1.0), which includes 864 unique animal opsin genotypes and corresponding λmax phenotypes from 73 separate publications.We demonstrate that regression-based ML models can reliably predict λmax from gene sequence alone, predict non-additive effects of mutations on function, and identify functionally critical amino acid sites.We provide an approach that lays the groundwork for future robust exploration of molecular-evolutionary patterns governing phenotype, with potential broader applications to any family of genes with quantifiable and comparable phenotypes.
  
  Reviewer 2. Nikolai Hecker
  
  The authors compiled a collection of opsin/rhodopsin proteins and their associated light absorption properties from literature; using the measured wavelength of maximum absorbance as a proxy. In addition, they include multiple sequence alignments (MSA) of the proteins including subsets of vertebrate and invertebrate sequences. The data is provided as tab-seperated, comma-seperated, and FASTA files. This is a valuable resource for studying opsins and vision related phenotypes. The authors then use gradient-boosting, random forests, and bayesian ridge regression to predict the wavelength of maximum absorbance from the protein sequence MSAs. Furthermore, they investigate whether their models can be used to identify amino acid changes that impact the wavelength of maximum absorbance and epistasis. This is based on a small set of opsin mutants that have been reported in literature. The manuscript is well structured and written. I have some concerns regarding the analysis, description and presentation of the data.
  
  A traditional cross-validation by random sampling can be inadequate for phylogenetically related sequences. If closely related species are part of the data training and test sets may contain nearly identical sequences. Excluding entire lineages instead of random sequences during training would circumvent this issue.
  
  Based on Fig. 3, Fig. 2, and p. 6, the models do not generalize well given that they only predict mutants well which exhibit a similar wavelength of maximum absorbance as the wild type. Based on the plots (Fig. 2 and Fig. 3) it does not look to me like the model trained on mutants+WT performs substantially better than the WT model for mutants with large wavelength shifts. This would be in contrast to p.16 "Particularly illustrative of these ideas are our analyses with and without experimentally mutated opsins". The authors should either show statistics regarding the improved performance for mutants with large shifts or change the corresponding parts.
  
  The data set description should be more detailed in parts. It should be shown how the opsins/rhodopsins classes (UVS, SWS, MWS, Rhodopsins, LWS) are distributed across the vertebrate and invertebrate phylogeny, for example by a phylogenetic tree and their number per species. Are the mutated opsins/rhodopsins derived from a small set of species or do they reflect most of the vertebrate phylogeny?
  
  How importance scores are estimated for the different models should be explained.
  
  The "ML often predicts the effects of epistatic mutations" section needs some clarifications. Why were only three sequences investigated? Do none of the other double mutants show epistasis when compared with the corresponding single mutations? In this paragraph, it is not always clear whether wavelengths and additive wavelengths are obtained from predictions or actual measurements.
  
  The description in git repository (https://github.com/VisualPhysiologyDB/visual-physiology-opsin-db) is very sparse. The content of the different files and how they relate to each other should at least be briefly explained in a README. It would also be helpful to add gene names, and the source of the sequence to the meta files.
  
  Minor comments
  
  For Fig. 4A, since MSAs are already computed it would be interesting to indicate the conservation amino acids per position. Are important amino acids correlated with sequence conservation?
  
  In Tab. 1, R2 is used to compare different models which are based on different subsets, and also pot. differently sized MSAs. An adjusted R2 might be more suitable to account for different numbers of features.
  
  It would be helpful to add a Docker image to the github repository to make it easier to use.
  
  Re-review: The authors have addressed the majority of my concerns and improved the manuscript. However, there are still some remaining issues that the authors should address before I would recommend the publication of the manuscript.
  
  Dependence between data points is not a novel problem for data analysis and machine learning in a broad range of subjects. While I appreciate that the authors added a paragraph discussing the issue of phylogenetic relatedness, the setup of the cross-validation and how the data is presented make it difficult to assess to which extend their models over-fit to the data. Referring to their previous reply, lineage-/group-based cross-validation should not be arbitrarily chosen but based on the structure of the data. This is not always a trivial problem and magic solution, I agree. The authors should at least incorporate references to literature discussing the problem and potential solutions for dealing with phylogenetic relatedness at p.11 "While these performance metrics are impressive, it is important to remember that phylogenetic relatedness between sequences..." or in the discussion. For example, Roberts et al. provide a nice overview for cross-validation strategies in various settings including phylogenetic data (they call it "block cross-validation"):
  
  Roberts et al. (2017). Cross‐validation strategies for data with temporal, spatial, hierarchical, or phylogenetic structure. Ecography, 40(8), 913-929.
  
  Regarding the comparison between models trained on wild type (WT) and WT + mutant data (WDS), I find the comparison rather difficult to follow concerning repeatably leaving out 25 mutants. For comparing the models, the test set (or test sets) should be the same. This would mean to assess the predictions for the same 25 left out mutants by both the WT and the WDS model (for each 25 left out mutants). If this was done already I would recommend rephrasing the corresponding part in the methods and results to improve the clarity. In addition, a visualization, for instance, a violin plot of the WT model RMSEs vs. violin plot of the WDS model RMSEs would be useful for the readers.
  
  I would still recommed adding a brief summary of how feature importance scores are computed. So, the reader does not have to look up another manuscript. This does not have to be detailed. As I understand, the feature importance is just the normalized number of feature occurrences or the Gini importance for gradient boosting/random forests or the coefficient for regression models.
  
  Minor details:
  
  Fig. S10: the text at the leaves is not readable. It could be replaced, for instance, with the name of the gene family if that make sense, or removed.
  
  Fig. 4A: the bars at position 181, 261, and 308, could be indicated, for example, in red or another color, to easier compare A and B.
3. GigaScience 23 Nov 2024
  
  in GigaScience
  
  Background Predicting phenotypes from genetic variation is foundational for fields as diverse as bioengineering and global change biology, highlighting the importance of efficient methods to predict gene functions. Linking genetic changes to phenotypic changes has been a goal of decades of experimental work, especially for some model gene families including light-sensitive opsin proteins. Opsins can be expressed in vitro to measure light absorption parameters, including λmax - the wavelength of maximum absorbance - which strongly affects organismal phenotypes like color vision. Despite extensive research on opsins, the data remain dispersed, uncompiled, and often challenging to access, thereby precluding systematic and comprehensive analyses of the intricate relationships between genotype and phenotype.Results Here, we report a newly compiled database of all heterologously expressed opsin genes with λmax phenotypes called the Visual Physiology Opsin Database (VPOD). VPOD_1.0 contains 864 unique opsin genotypes and corresponding λmax phenotypes collected across all animals from 73 separate publications. We use VPOD data and deepBreaks to show regression-based machine learning (ML) models often reliably predict λmax, account for non-additive effects of mutations on function, and identify functionally critical amino acid sites.Conclusion The ability to reliably predict functions from gene sequences alone using ML will allow robust exploration of molecular-evolutionary patterns governing phenotype, will inform functional and evolutionary connections to an organism’s ecological niche, and may be used more broadly for de-novo protein design. Together, our database, phenotype predictions, and model comparisons lay the groundwork for future research applicable to families of genes with quantifiable and comparable phenotypes.
  
  This work has been published in GigaScience Journal under a CC-BY 4.0 license (https://doi.org/10.1093/gigascience/giae073). The peer-reviews are as follows.
  
  Reviewer 1. Robert Lucas.
  
  Frazer and colleagues set out to assess the ability of machine learning (ML) approaches to predict spectral sensitivity (lmax) of animal opsins from their amino acid sequence. To this end they first develop a database of phenotyped opsins (opsin sequences with known lmax), which they term Vpod1. They then explore how various factors of the ML process impact its ability to predict lmax. These include the nature of the input training dataset (size, phylogenetic and gene family diversity, inclusion of data from mutagenesis experiments) and the ML method. For comparison they include a phylogenetic imputation approach that predicts lmax based upon overall sequence similarity. They test the validity of their approach according to their ML pipelines' ability to predict: lmax for the training dataset; the outcome of mutagenesis; lmax for a test dataset extracted from the training dataset; known epistatic interactions; and established spectral tuning sites. In all cases, they report various degrees of success and conclude that the ML approach can be used to predict lmax (almost as well as phylogenetic imputation but with reduced computational cost) provided that the training dataset is sufficiently rich (it performs poorly for invertebrate opsins for which data are limited) and, ideally, benefits from mutagenesis datasets.
  
  I am no expert in machine learning and will leave others to comment on that aspect of methodology but in general this study represents an interesting addition to the literature. The idea of predicting lmax from amino acid sequence is not new e.g. as the authors acknowledge the '5 sites rule' for cone pigments is long established. Applying ML holds the promise of a more efficient process for achieving similar predictability for other branches of the animal opsin family. In that regard, the inherent limitation in the ML approach is highlighted - it is particularly valuable in branches of the family for which information is sparse (invertebrate opsins), but performs poorly in those branches without more starting information about structure:function relationships (which itself replaces the need for ML to some extent). Nonetheless, it certainly has the potential to be a valuable tool and this paper represents a sound exploration of its characteristics and one important feature of the paper is that it confirms that ML can allow fairly good predictions based solely on data from wildtype opsin sequences.
  
  I have relatively few suggestions for improvement. The most important is that the authors appear to have omitted one technology for the process of defining lmax (introduction, method and discussion). We and others have used heterologous action spectroscopy to describe lmax for a growing number of animal opsins. In this technique spectral sensitivity is defined using live cell assays of light response for opsins expressed in immortalised cell lines. Those data could be included in the Vpod1 dataset. It would also be appropriate to mention the approach as a tool for populating the training dataset as it has the advantage of being applicable to opsins that don't reliably form pigments in vitro (e.g. many invertebrate opsins) and does not rely on access to the animal itself but only to its genome sequence. The authors also may wish to relate Vpod1 to another recently published database of animal spectral sensitivities albeit collected for a different purpose (https://doi.org/10.1016/j.baae.2023.09.002).
  
  Some minor points. The authors note with surprise that ML performed poorly for the rod opsin dataset. Could this be because their metric (R2) is sensitive to the degree of variability in lmax in the training dataset, which is constrained in rod opsins? I found the pastel colours in Fig2 and 3 hard to discern, more separation on the colour pallet would be appreciated.
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2024.02.12.579993v1
www.biorxiv.org www.biorxiv.org

CoCoPyE: feature engineering for learning and prediction of genome quality indices

2
1. GigaScience 23 Nov 2024
  
  in GigaScience
  
  AbstractThe exploration of the microbial world has been greatly advanced by the reconstruction of genomes from metagenomic sequence data. However, the rapidly increasing number of metagenome-assembled genomes has also resulted in a wide variation in data quality. It is therefore essential to quantify the achieved completeness and possible contamination of a reconstructed genome before it is used in subsequent analyses. The classical approach for the estimation of quality indices solely relies on a relatively small number of universal single copy genes. Recent tools try to extend the genomic coverage of estimates for an increased accuracy. CoCoPyE is a fast tool based on a novel two-stage feature extraction and transformation scheme. First it identifies genomic markers and then refines the marker-based estimates with a machine learning approach. In our simulation studies, CoCoPyE showed a more accurate prediction of quality indices than the existing tools.
  
  This work has been published in GigaScience Journal under a CC-BY 4.0 license (https://doi.org/10.1093/gigascience/giae079). The peer-reviews are as follows.
  
  Reviewer 1. Xiaoquan Su
  
  In this work, authors proposed CoCoPyE to evaluate the genome quality constructed from metagenomes by a two-stage approach. In general, this work is valuable for the research works in this field, and some issues should be addressed before further consideration for publication.
  
  In section 2.1, how the threshold of 60% and 30% were determined?
  
  In the 2.1.6 section, there were two different prediction method, including linear and non-linear prediction, so in actual senses, how to choose a proper way?
  
  For the simulation, I also suggest to make some simulation for specific habitat metagenomes, e.g. human-associated habitats (gut, oral, etc.), or natural environments (soil, marine).
  
  For the online demo at https://cocopye.uni-goettingen.de/, a demo fasta input file can be useful for quick startup.
  
  Re-review: All my previous comments have been addressed and I have no more question.
2. GigaScience 23 Nov 2024
  
  in GigaScience
  
  The exploration of the microbial world has been greatly advanced by the reconstruction of genomes from metagenomic sequence data. However, the rapidly increasing number of metagenome-assembled genomes has also resulted in a wide variation in data quality. It is therefore essential to quantify the achieved completeness and possible contamination of a reconstructed genome before it is used in subsequent analyses. The classical approach for the estimation of quality indices solely relies on a relatively small number of universal single copy genes. Recent tools try to extend the genomic coverage of estimates for an increased accuracy. CoCoPyE is a fast tool based on a novel two-stage feature extraction and transformation scheme. First it identifies genomic markers and then refines the marker-based estimates with a machine learning approach. In our simulation studies, CoCoPyE showed a more accurate prediction of quality indices than the existing tools.Competing Interest StatementThe authors have declared no competing interest.
  
  Reviewer 2. Robert Finn
  
  The paper by Birth et al describes CoCoPy, a two stage pipeline for the estimation of completeness and contamination of prokaryotic genomes, especially for the assessment of metagenome assembled genomes. The paper was well written and clearly outlined the aims of the software, the approach and the need for a two stage process. I also appreciate the candid nature of the discussion that CoCoPy should be considered as complementary to CheckM2. The performance in terms of time is a notable consideration why this tool should be considered by the field, and the benchmarks of completeness and contamination are encouraging. The main drawback of the tool is the need for a close reference genome for the second stage quality estimation, which will limit use for environmental metagenomics.
  
  Major comments
  
  While I appreciate the benefits in terms of speed offered by UProC, there are a number of questions that are not adequately addressed in the manuscript. The first is why the version of Pfam is so out of date, with version 24 and 28 being used in the feature classification. The authors remarked about the improvement between Pfam 24 and 28. Pfam is now on version 36, with a release produced about once a year. During this time, the Pfam entries have been expanded in number, increased in sequenced diversity and optimised in terms of boundaries. This is particularly pertinent now, with the use of AlphaFold models improving domain boundaries. Secondly, Pfam models have per model thresholds, but there was no discussion of thresholds used. Finally, Pfam Clans were introduced in Pfam version 18.0, as a way of modelling diverse families with multiple profile HMMs. While many of these families are unlikely to represent single copy marker genes, there is still the case that two families belonging to a same clan could be measured as a dissimilarity, when actually they are representing the same protein family. This is particularly important in the marker based estimates and count histogram ratio.
  
  It would also be beneficial for the reader to see the results from genomes simulate with a fragmentation profile that more closely represents that of MAGs, where there may be a few long contigs in the 100kbp range, and then quickly tails off to contigs in the 1000s bp range. Also, the authors should try and estimate the amount of blind contamination, i.e. contigs that have no single marker genes. This is an important metric which is typically overlooked by current tools. This particularly applies to those MAGs where they fail to be passed on to the second phase of contamination.
  
  The second stage of the CoCoPy should also be benchmarked against tools such as GUNC, which similarly uses features from reference genomes to estimate completeness and contamination. This would help guide the reader to understanding whether running CheckM2 with GUNC or CoCoPy would be advantageous.
  
  Minor In the introduction, the authors omit the part of the MIMAG standard that requires the presence of tRNAs and SSU/LSU also need to be present to refer to the genome as high quality, not simply based on completeness and contamination.
  
  In the "Reference database" section it would be informative to know the number of Pfam entries (and their accessions) that are considered single copy marker genes. Also, the best concept of completeness is having a closed, circular genome in RefSeq.
  
  In the construction of the test data it would be useful to provide a measure of taxonomic distance between the genomes in the training dataset and the test dataset. While this is difficult, a basic metric such as average branch length to nearest neighbour, or number of steps away from the nearest neighbour in the GTDB taxonomic tree, but some level of information would be informative, rather than simply not being the same taxID.
  
  How sensitive is the second stage to completeness? Conceivably, the use of MAGs to enrich the sequence space could improve the second stage, if strict completeness and contamination rules were applied?
  
  Re-review: I appreciate the authors effort in trying to update the version of the Pfam database. It is disappointing that new versions of the resource are not being considered down to a technical problem in the UProC implementation. While it is highlighted that the older versions of Pfam provide computational advantages, there are many potential solutions that could be found to overcome this, and the realms of memory requirements are not vastly different to CheckM1. While I understand the difference between HMMER and UProC, it is as much about the improvements in domain boundaries and increased coverage of sequence space that the more recent versions harbour that I would expect to improve the performance. The advent of AlphaFold has resulting is a large number of improvements to the Pfam boundaries. As the authors have a fix, it is slightly strange they have not included the results in the response to the reviewers comments. The fix may be limited in scope and not officially release, but it would be more convincing to show the results of Pfam v36 against v24/28, thus allowing an informed judgement. I look forward to the release of an updated version of UProC in the near future, as promised by the authors.
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2024.02.07.579156v2
www.biorxiv.org www.biorxiv.org

spatiAlign: An Unsupervised Contrastive Learning Model for Data Integration of Spatially Resolved Transcriptomics

1
1. GigaScience 23 Nov 2024
  
  in GigaScience
  
  Competing Interest StatementThe authors have declared no competing interest.
  
  Reviewer 3. Jose Fernandez Navarro
  
  The authors present a novel computational method to integrate SRT datasets claiming that the method adjusts for batch effects while retaining the biological differences. The method provides the possibility to adjust the gene expression counts to be used for downstream analysis. The method was benchmarked against other methods that are available for integration of single cell and spatial transcriptomics datasets obtaining positive results. The manuscript is well structured and clear, it provides a robust motivation and the comparisons with other methods are clear and well defined. The method has the potential to make a contribution to the field, specially considering that it has been developed to be compatible with scanpy and that an open-source library has been made available on GitHub.
  
  Introduction:- In the following sentence: "batch effects caused by nonbiological factors such as technology differences and different experimental batches." the authors could have elaborated more and perhaps included some references.- In the following sentence: "In contrast, popular MNN-based methods such as Seurat v3[16] efficiently address batch effects in gene expression, but their limitation lies in the ability to align only two batches at a time, and they become impractical when dealing with many batches" I do not think the MNN-based term is correct in that context. Also, I do not entirely agree in the claim. One generally does not have many batches to correct for and the referred methods can perform batch correction in datasets with more than 2 batches.- In the following statement: "However, PRECAST only returns the corrected embedding space, and GraphST requires registering the spatial coordinates of samples first to ensure its integration performance; thus, their applications are limited in certain scenarios. "I'm not in total agreement, I understand PRECAST provides a module to obtain corrected gene expression counts for downstream analysis. Results:- I find the introduction to spatiAlign a bit long. It could perhaps be simplified and then leave the implementation details to the Methods section.- In the following sentence: "..spatial neighbouring graphs between cells/spots (e.g., cellâ€’cell adjacent matrix A), where the connective relationships of cells/spots are negatively associated with Euclidean distance." I find it a bit misleading, are the authors building the spatial graph using a fixed radius? Or euclidean distances in a manifold?- I could not find a detailed description on how the different datasets were processed with the others methods that they used to benchmark.- I believe to measure the power of the methods to retain biological differences, comparing consecutive sections of the same tissueis not enough. I would also include a comparison using sections from different individuals (same region).- In the MOB datasets comparison, by looking at the UMAP figures, the differences in performance it is not so clear in the cases of SCALED and BBKNN.In the Hippocampus dataset, I did not see information on how the clusters were annotated. It would have been nice to include the ABA figures of the same region. I found it difficult to understand the basis and interpretation of the spatial autocorrelation analysis with Moran's I. In the MOB embryo dataset, did the authors consider include a comparison with the other methods? Figures:I observed some of the supplementary figures are out of order or the labels do not match the panels, I encourage the authors to revise this. I also noticed some of the panels showing expression plots do not have a bar with the range of expression. The labels in some of the panels are hard to read and I miss some labels (f.e. the section/dataset in some of the panels).Some figures make reference to the ABA and/or the tissue morphology. For these, I could suggest including the HE images and/or IF images from the ABA. Figure 2a-c: the fonts are hard to read. Figure 2d is hard to read, perhaps the layout would be better by making it one column per method?. Figure 3g would be easier to read if the 3 datasets were arranged side by side. Figure S4, I find the clusters hard to see clearly.
  
  Datasets and documentation: The authors provide links to the original datasets but they do not provide access to the processed and annotated datasets, this makes it difficult to replicate the results and the examples provided in the documentation. The manuscript would benefit if the authors would provide better documentation and means to reproduce/replicate the analyses.
  
  Software: I was able to install the package with PyPy in a Conda environment but I had to manually install some dependencies to make it work.Major comments:- I would like to suggest the authors to revise the figures. The supplementary figures descriptions do not seem to match the content of the figures. Some of the figures are missing labels and color bars.- I would like to suggest the authors to correct for grammar and misspelling errors and perform a throughout proof reading of the manuscript for consistency.- I would like the authors to provide links to access the processed/annotated datasets.- I would like the authors to provide more details on how the datasets were processed with their method and the others method (hyperparameters, versions, etc..). This could be complemented greatly if the authors could provide notebooks or step-by-step documentation.- I would like to suggest the authors to include a comparison with true biological differences such as different phenotypes and/or genotypes.- I would like to suggest the authors to include some of other methods in the MOB (stereo-seq) comparison.- I would like to suggest the authors to check their claim that PRECAST does not provide "corrected" gene counts or that the other methods do not provide means to perform downstream analyses (DEG, trajectory inference, etc…).- I would like to suggest the authors to include normalized counts as well as raw counts in some of the comparisons (for example when performing the trajectory analysis or showing the spatial distribution of features). Minor comments:- I would like to suggest the authors to not use the term "expression enhacenment", to me the gene expression is corrected or adjusted but not enhanced.- I would like to suggest the authors to improve the documentation of the open-source package to provide more information on the different arguments and options. It would also be nice to provide documentation and/or notebooks to reproduce the analysis (or some) presented in the manuscript.- I would like to suggest the authors to improve the installation of the PyPy package since some dependencies seem to be missing.- I would like to suggest the authors to improve the layouts and font size of some of the for clarity and readability.
  
  Re-review: I acknowledge the efforts made by the authors to address the comments and provide answers. However, I still find the manuscript not ready for publication. These are my comments: Major:- The authors have included a new analysis (sup. figure 7) using a dataset (tumor liver) that lacks a stereotypical structure. While this is a good addition to the manuscript, I would still like to see the performance of spatiAlign in correcting technicaleffects while retaining true biological differences (f.e. disease and control). In addiction to this, a comparison using a imaging-based technology (f.e Merfish or CosMx) would make the manuscript stronger.- The authors have made an effort to provide Jupyter notebooks with code to reproduce the analyses. Unfortunately, this is uncompleted. None of the notebooks contain code to reproduce the spatiAlign analyses and only the notebook with the tumor liver dataset (sup. figure 7)includes the processing steps. For the other datasets they authors use hard-coded values. Moreover, I was unable to run some of the notebooks due to errors and missing files and/or dependencies. The authors should provide one notebook for each dataset including the processing and analysis and provide means to run the notebooks (environment files and/or docker files) in an easy way that enables reproduciblity. Ideally, these notebooks should also include the spatiAlign analysis.- I observed a strange effect in figure 2 where the UMAP manifolds of the BBKNN, Harmony and Combat are similar. I could identify the error causing this in one of the notebooks. I strongly suggest the authors to revise all the analyses and figures and to provide notebooks to reproduce these in an easy way as I mentioned before.- I find the MNN performance surprisingly bad. I wonder if this could be due to how the data was processed with this method. Did the authorstry to disable cosine normalization for the output?.
  
  Minor:- I think the manuscript would be stronger if the authors would include the normalized counts in the figures where they show the raw counts.- I still find inconstancies in the text (typos, grammatical and syntactical errors). The authors are still using the term enhanced (specially in figure legends).- In the MOB dataset, the authors claim that the Visium spots are 100mm but that cannot be true, visium spots are 50mm.- In figure 3 (panel f) use the same layout as figure 2 for consistency.- In figure 4 (panel g) the color bar and labels are missing.- In Sup. figure 3 (panel c) the color bar is out of place and the legend is missing.
  
  Re-review: The authors have made a great effort to improve the manuscript. The improvements on the documentation and open-source package will be appreciated by the community. I only have minor comments:- The grammar has improved but I could still see some errors (to cite a few):- line 96 "dimensional reduction"- line 346 "structure and MERFISH"- I still think that the authors have not been able to fully demonstrate the performance of their method to integrate datasets with true biological/phenotypical differences (f.e. disease and healthy). Supplementary figures 7 and 8 add value of the manuscript by integrating tumor cells from different patients but this is not exactly what reviewer 1 and Isuggested. I acknowledge the explanations that the authors provide in their response but I'm not in total agreementwith the statements. There are publicly available datasets that could suit this analysis. I will not request to amend such analysis to the manuscript but I could at least suggest to mention this in the manuscript as a limitation or future work.
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2023.08.08.552402v2
www.biorxiv.org www.biorxiv.org

StereoSiTE: A framework to spatially and quantitatively profile the cellular neighborhood organized iTME

3
1. GigaScience 23 Nov 2024
  
  in GigaScience
  
  With emerging of Spatial Transcriptomics (ST) technology, a powerful algorithmic framework to quantitatively evaluate the active cell-cell interactions in the bio-function associated iTME unit will pave the ways to understand the mechanism underlying tumor biology. This study provides the StereoSiTE incorporating open source bioinformatics tools with the self-developed algorithm, SCII, to dissect a cellular neighborhood (CN) organized iTME based on cellular compositions, and to accurately infer the functional cell-cell communications with quantitatively defined interaction intensity in ST data. We applied StereoSiTE to deeply decode ST data of the xenograft models receiving immunoagonist. Results demonstrated that the neutrophils dominated CN5 might attribute to iTME remodeling after treatment. To be noted, SCII analyzed the spatially resolved interaction intensity inferring a neutrophil leading communication network which was proved to actively function by analysis of Transcriptional Factor Regulon and Protein-Protein Interaction. Altogether, StereoSiTE is a promising framework for ST data to spatially reveal tumoribiology mechanisms.Competing Interest StatementThe authors have declared no competing interest.
  
  Reviewer 3. Rosalba Giugno
  
  Authors introduce StereoSiTE, which integrates open-source bioinformatics tools with the self-developed algorithm SCII. The aim is to dissect a cellular neighborhood (CN) organized iTME based on cellular compositions and accurately infer functional cell-cell communications with quantitatively defined interaction intensity in ST data.
  
  The paper's objective is commendable, and the overall organization of the content, along with the obtained results, holds great promise. Nevertheless, certain aspects need to be addressed. The proposed approach's novelty is significantly anchored in the SCII software. However, the paper has notable drawbacks. It falls short in providing a theoretical and scientific comparison with other similar tools. Moreover, the comparison includes systems that do not incorporate spatial considerations, posing a limitation in assessing the method's uniqueness in a broader context.
  
  Give more details on which systems are you referring here: "To improve accuracy, we recommended using spatially resolved data at single cell resolution". Please provide your insights on the rationale for employing or abstaining from downstream analysis to comprehend the spatial distribution of gene expression in tissue, as https://doi.org/10.1093/gigascience/giac075 and https://doi.org/10.1038/s41467-023-36796-3. Additionally, consider discussing how this is associated with the prediction, validation of the functional enrichment or on step: Clustering bins into different cellular neighborhoods based on their cellular composition.
  
  Re-reviews The authors have solved my issues.
2. GigaScience 23 Nov 2024
  
  in GigaScience
  
  AbstractWith emerging of Spatial Transcriptomics (ST) technology, a powerful algorithmic framework to quantitatively evaluate the active cell-cell interactions in the bio-function associated iTME unit will pave the ways to understand the mechanism underlying tumor biology. This study provides the StereoSiTE incorporating open source bioinformatics tools with the self-developed algorithm, SCII, to dissect a cellular neighborhood (CN) organized iTME based on cellular compositions, and to accurately infer the functional cell-cell communications with quantitatively defined interaction intensity in ST data. We applied StereoSiTE to deeply decode ST data of the xenograft models receiving immunoagonist. Results demonstrated that the neutrophils dominated CN5 might attribute to iTME remodeling after treatment. To be noted, SCII analyzed the spatially resolved interaction intensity inferring a neutrophil leading communication network which was proved to actively function by analysis of Transcriptional Factor Regulon and Protein-Protein Interaction. Altogether, StereoSiTE is a promising framework for ST data to spatially reveal tumoribiology mechanisms.Competing Interest Statement
  
  Reviewer 2. Chenfei Wang
  
  In this manuscript, Xin. et al. provided a framework called StereoSiTE that incorporated the established methodologies with their developed algorithm to defined cellular neighborhood (CN) organized immune tumor microenvironment (iTME) based on cellular compositions, and to dissected the spatial cell interaction intensity (SCII) in spatial transcriptomics (ST). StereoSiTE has the following improvements compared to existing methods. First, SCII detects cell-cell communication using both cell space nearest neighbor graph and targeted L-R expression. Second, SCII taken the interaction distance account for different interaction classification such as secreted signaling, ECM receptor and cell-cell contact. Finally, StereoSiTE could avoided to detected the false positive interactions caused by limited reachable interaction.
  
  Although the authors performed comprehensive works to demonstrate the potential applications of StereoSiTE. This reviewer has strong concerns about the potential novelty and effectiveness of StereoSiTE over existing methods. The CN results were not mindful of the spatial information, and the labeled cellular neighborhood (CN) may mislead users. Additionally, although the L-R pair could be categorized into three classifications based on interaction distance, the SCII only uses different radius to infer cell communication without employing a different strategy for predicting interactions in distinct L-R pairs. I have the following detailed comments.
  
  Comments： 1. The authors fail to show the novelty and advantages of CN compared to other methods, such as DeepST, which integrates gene expression, spatial location and image information. The authors should provide the comparison with the several recent strategies that consider the effect of local niches including BANKSY, stLearn, Giott, and DeepST. 2. The authors should compare SCII with additional methods such as CellPhoneDB v3 and Cellchat v2, demonstrating its superior performance. 3. The method used for cell segmentation should offer more comprehensive information rather than solely citing "Li, M. et al. (2023)". 4. Format of the paper. The alignment inconsistency within the manuscript—with some paragraphs centered and others justified—should be corrected for uniformity. 5. The figures and manuscript containing 'Teff' and 'M2-like' cell types should provide a legend explaining the abbreviations for clarity. 6. The font size of the labels in Figures 5E-F is insufficient for easy reading and should be enlarged. Re-review: In the response letter, the author emphasizes the novelties of the StereoSiTE framework and demonstrates how the StereoSiTE software was specifically designed to address the question of "how iTME responds and functions under stimulation" using stereo-seq data. The author highlights notable enhancements to the self-development algorithm, including CN and SCII. The CN algorithm focuses on evaluating the cell composition in iTME, while SCII is designed to infer the intensity of spatial cell interactions. These advancements have been incorporated into the updated version of the manuscript. Notably, the SCII component of the framework combines spatial information and expression patterns to infer that cell-cell communication can limit reachable interactions, thereby reducing false positive interactions. The authors have also employed distinct strategies to predict different types of L-R pairs with varying interaction distances, encompassing secreted signaling, ECM-receptor, and cell-cell contact. In the case of secreted type L-R pairs, SCII enables the specification of varying radius thresholds to infer spatial cell communication. However, it is recommended that the authors consider the exponential decay of expression values, particularly when the radius exceeds 100 μm.
  
  The response also outlines the authors' claim that CN exhibits good performance compared to other tissue domain division methods (BANKSY and Giotto HMRF). However, upon reviewing the performance comparison results, it becomes apparent that BANKSY outperforms the other methods, although the CN method shows nearly consistent performance with BANKSY on the benchmark dataset STARmap. To substantiate the preference for CN over BANKSY, the authors are encouraged to provide evidence of its user-friendly interface, shorter run time, or lower memory usage. Overall, the revisions and enhancements made to the StereoSiTE framework significantly improve its functionality and analytical capabilities. The StereoSiTE software holds great promise in providing invaluable insights and support for potential users and researchers in the field.
3. GigaScience 23 Nov 2024
  
  in GigaScience
  
  AbstractWith emerging of Spatial Transcriptomics (ST) technology, a powerful algorithmic framework to quantitatively evaluate the active cell-cell interactions in the bio-function associated iTME unit will pave the ways to understand the mechanism underlying tumor biology. This study provides the StereoSiTE incorporating open source bioinformatics tools with the self-developed algorithm, SCII, to dissect a cellular neighborhood (CN) organized iTME based on cellular compositions, and to accurately infer the functional cell-cell communications with quantitatively defined interaction intensity in ST data. We applied StereoSiTE to deeply decode ST data of the xenograft models receiving immunoagonist. Results demonstrated that the neutrophils dominated CN5 might attribute to iTME remodeling after treatment. To be noted, SCII analyzed the spatially resolved interaction intensity inferring a neutrophil leading communication network which was proved to actively function by analysis of Transcriptional Factor Regulon and Protein-Protein Interaction. Altogether, StereoSiTE is a promising framework for ST data to spatially reveal tumoribiology mechanisms.
  
  This work has been published in GigaScience Journal under a CC-BY 4.0 license (https://doi.org/10.1093/gigascience/giae078), and published as part of our Spatial Omics Methods series. The peer-reviews are as follows.
  
  Reviewer 1. Lihong Peng
  
  In this manuscript, the authors developed a computational framework named StereoSiTE to spatially and quantitatively profile the cellular neighborhood organized iTME by incorporating open source bioinformatics tools with their self-proposed algorithm named SCII. This study is very meaningful. However, it remains several problems.
  
  Major comments: 1. The authors incorporated several open sources bioinformatics tools. However, how to ensure their combination is the optimal to the spatially resolved cell-cell communication inference performance? For example, cell2location was used to deconvolute cellular composition and construct cellular neighborhood. Why to use cell2location for deconvoluting spatial transcriptomics data? why not use the newest deconvolution algorithms, for example, SpaDecon, Celloscope, POLARIS, GraphST, SPASCER, and EnDecon? No model can adapt to all data. The authors should first verify that cell2location is the best appropriate cell type annotation tool corresponding to iTME. If not, the subsequent analyses will be not appropriate.
  
  The authors claimed that they computed the decomposition losses of different combinations of the number of CN modules and CT modules. Which combinations? The authors should list them.
  
  When measuring spatial cell interaction intensity, the authors only simply summed up the ligand and receptor gene expression information of the sender and receiver cells. Why not consider existing classical intercellular communication intensity methods? The authors should compare other intercellular communication intensity measurement methods. Please refer to the following two cites: Cell-cell communication inference and analysis in the tumour microenvironments from single-cell transcriptomics: data resources and computational strategies, briefings in bioinformatics. CellDialog: A Computational Framework for Ligand-receptor-mediated Cell-cell Communication Analysis, IEEE Journal of Biomedical and Health Informatics. Deciphering ligand-receptor-mediated intercellular communication based on ensemble deep learning and the joint scoring strategy from single-cell transcriptomic data, Computers in Biology and Medicine.
  
  For protein-protein interaction analysis, the authors queried 628 significant up regulated genes in CN5 area of treatment samples from STRING. Can all obtained proteins be ligands or receptors? In addition, they labeled hub genes and key protein-protein interaction networks, what were these hub genes and key networks used for?
  
  Which ligand-receptor pairs could mediate intercellular communication within immune tumor microenvironment? Among these L-R pairs, which L-R pairs are known in existing databases and which L-R pairs are the predicted ones?
  
  "The enrichment analysis of individual CN showed that each CN had a dominant cell type with a spatial aggregation (Fig 2F), which was increasingly obvious than that in whole slide (Fig 2E)." What's a dominant cell type? How to define it?
  
  "To reduce the variance among open-sourced L-R databases, we unified L-R database in SCII by choosing L-R dataset in CellChatDB, which assigned each L-R with an interaction distance associated classification as secreted signaling, ECM receptor and cell-cell contact." How to unify L-R database? Did it allow for user-specified LR databases and/or add user-specified LR databases?
  
  In figure 3, how to confirm which L-R pairs mediate intercellular communication?
  
  StereoSiTE is composed of multiple modules, is it scalable? Can some of these modules (such as clustering and cell type annotation) be replaced with other more powerful modules?
  
  The authors claimed that "CellPhoneDB detected many false positive interactions". How to find these false positive LRIs? How to validate the LRIs be false positives? Please list the found false positive LRIs.
  
  In Figure 3, the authors should add comparison experiments between StereoSiTME and classical intercellular communication analysis tools.
  
  Minor comments: 1. The text in subfigure A, B, and C in Supplementary Figure 2 is obscure. The authors should revise Supplementary Figure 2. 2. In Section "Abstract", iTME should use full name when it first appears. 3. Which cites of "13 Li, M. et al. (2023)." is in the reference list?
  
  Re-review:
  
  In the revised manuscript, the authors conducted lots of revisions. However, it still remains many problems to solve:
  
  The authors have compared the performance of Cell2location with other cell type identification methods, Celloscope[10], GraphST[11], and POLARIS[12] on on both STARmap and stereo-seq dataset of liver cancer. How about its performance on other unlabeled datasets? Please compare it with "STGNNks: Identifying cell types in spatial transcriptomics data based on graph neural network, denoising auto-encoder, and 𝑘-sums clustering".
  
  Cell-cell communication is usually mediated by LRIs. The construction of high-quality LRI databases is very important to cell-cell communication. The authors should introduce these LRI data resources and potential LRI prediction methods and cite them, for example, PMID: 37976192, 37364528, 38367445.
  
  In Figure 4B, 4C, 4D, and 4F, Figure 5A and 5B, Figure 6B and 6C, the fonts are too small. Please enlarge the fonts.
  
  The organization and structure of this manuscript must be carefully revised. For example, The structure in Discussion is obscure. In the first paragraph in this section, the authors have introduced their proposed method, next, they described it in details. But the third paragraph elucidated the reason why to develop this reason. In addition, "Figure 3 highlights that the analysis without distance threshold may lead to false positive results, and SCII showed more superior performance than other methods." why to Figure 3? Did not the other results support their conclusion? The final paragraph in Discussion introduced their method again. It HAS NO logic.
  
  Where is the conclusion of this manuscript?
  
  The authors should analyze the limitations of this work for further work in the future.
  
  English is VERY POOR. This manuscript must be carefully revised. For example,
  
  "prove that spatial proximity is a must to guarantee an effective investigation.", is a must to do?
  
  Re-re-review: The authors have solved my issues.
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2022.12.31.522366v3
www.biorxiv.org www.biorxiv.org

A data standard for the reuse and reproducibility of any stable isotope probing-derived nucleic acid sequence (MISIP)

3
1. GigaScience 23 Nov 2024
  
  in GigaScience
  
  DNA/RNA-stable isotope probing (SIP) is a powerful tool to link in situ microbial activity to sequencing data. Every SIP dataset captures distinct information about microbial community metabolism, kinetics, and population dynamics, offering novel insights according to diverse research questions. Data re-use maximizes the information available from the time and resource intensive SIP experimental approach. Yet, a review of publicly available SIP sequencing metadata reveals that critical information necessary for reproducibility and reuse is often missing. Here, we outline the Minimum Information for any Stable Isotope Probing Sequence (MISIP) according to the Minimum Information for any (x) Sequence (MIxS) data standard framework and include examples of MISIP reporting for common SIP approaches. Our objectives are to expand the capacity of MIxS to accommodate SIP-specific metadata and guide SIP users in metadata collection when planning and reporting an experiment. The MISIP standard requires five metadata fields: isotope, isotopolog, isotopolog label and approach, and gradient position, and recommends several fields that represent best practices in acquiring and reporting SIP sequencing data (ex. gradient density and nucleic acid amount). The standard is intended to be used in concert with other MIxS checklists to comprehensively describe the origin of sequence data, such as for marker genes (MISIP-MIMARKS) or metagenomes (MISIP-MIMS), in combination with metadata required by an environmental extension (e.g., soil). The adoption of the proposed data standard will assure the reproducibility and reuse of any sequence derived from a SIP experiment and, by extension, deepen understanding of in situ biogeochemical processes and microbial ecology.Competing Interest StatementThe authors have declared no competing interest.
  
  Reviewer 3. Xiaoxu Sun
  
  The paper titled "MISIP: A Data Standard for the Reuse and Reproducibility of Stable Isotope Probing Derived Nucleic Acid Sequence and Experiment" presents a compelling argument for establishing a minimum information standard for stable isotope probing (SIP) experiments. The proposed MISIP standard aims to facilitate data reuse and ensure the reproducibility of results within the scientific community. The authors have meticulously considered the essential information required for MISIP, resulting in a well-articulated manuscript. However, I have a few suggestions that could further refine the proposed standard.
  
  To me, one critical aspect of MISIP is to ensure it provides necessary details of the SIP incubations. Although the authors have integrated some of this information, which can overlap with other existing standards like MIMS/MIMARKS (e.g., sample origin), there are additional elements that should be included in MISIP, either as mandatory or recommended information.
  
  Suggestion 1: Inclusion of Additional Substrates in Incubations
  
  The paper rightly identifies the isotopologue as a requisite detail for MISIP. However, I recommend expanding this requirement to include a mention of other substrates added during incubations, at least as a recommended piece of information. While specifying the primary substrate (e.g., 13C-labeled glucose) is often sufficient for studies targeting heterotrophic processes, the identification of autotrophic populations using substrates like 13C-bicarbonate necessitates the disclosure of electron donors/acceptors to clarify the targeted metabolic processes.
  
  Suggestion 2: Detailed Reporting of Incubation Progress
  
  Although incubation time is suggested as a recommended field, I propose that details regarding the progress of the specified reactions should also be documented, such as the incorporated dose. This is particularly relevant when different substrate doses are used, as these can yield varied outcomes. For instance, the rate of substrate utilization can significantly differ across inoculums at identical time points; coastal sediment might consume 1 mM of glucose in a day, whereas deep-sea samples might take longer. Therefore, merely reporting incubation time without context may not provide sufficient insight for readers to gauge the dynamics of potential cross-feeding or other relevant processes.
  
  In conclusion, integrating these suggestions into the MISIP standard could enhance its comprehensiveness and utility. By providing a more detailed framework, researchers can better interpret experimental setups and results, fostering a more robust foundation for data reuse and reproducibility in the field of stable isotope probing.
  
  Re-review: Nice work on addressing all the comments. All my concerns have been addressed.
2. GigaScience 23 Nov 2024
  
  in GigaScience
  
  DNA/RNA-stable isotope probing (SIP) is a powerful tool to link in situ microbial activity to sequencing data. Every SIP dataset captures distinct information about microbial community metabolism, kinetics, and population dynamics, offering novel insights according to diverse research questions. Data re-use maximizes the information available from the time and resource intensive SIP experimental approach. Yet, a review of publicly available SIP sequencing metadata reveals that critical information necessary for reproducibility and reuse is often missing. Here, we outline the Minimum Information for any Stable Isotope Probing Sequence (MISIP) according to the Minimum Information for any (x) Sequence (MIxS) data standard framework and include examples of MISIP reporting for common SIP approaches. Our objectives are to expand the capacity of MIxS to accommodate SIP-specific metadata and guide SIP users in metadata collection when planning and reporting an experiment. The MISIP standard requires five metadata fields: isotope, isotopolog, isotopolog label and approach, and gradient position, and recommends several fields that represent best practices in acquiring and reporting SIP sequencing data (ex. gradient density and nucleic acid amount). The standard is intended to be used in concert with other MIxS checklists to comprehensively describe the origin of sequence data, such as for marker genes (MISIP-MIMARKS) or metagenomes (MISIP-MIMS), in combination with metadata required by an environmental extension (e.g., soil). The adoption of the proposed data standard will assure the reproducibility and reuse of any sequence derived from a SIP experiment and, by extension, deepen understanding of in situ biogeochemical processes and microbial ecology.
  
  Reviewer 2. Jibing Li
  
  In this study, the authors meticulously delineated the Minimum Information about Stable Isotope Probing (MISIP) data standard within the broader framework of the Minimum Information about any (x) Sequence (MIxS) data standard. By extending the scope of MIxS to incorporate SIP-specific metadata, the authors have provided invaluable guidance to SIP practitioners regarding the collection and reporting of essential metadata for SIP experiments. Adoption of the proposed MISIP data standards is poised to significantly augment the reusability of sequence data derived from SIP experiments, thereby fostering a deeper understanding of in situ biogeochemical processes and microbial ecology. While the manuscript presents novel insights, further refinement is necessary to optimize its impact.
  
  The MISIP data standard holds paramount importance in the realm of stable isotope probe (SIP) technology as it standardizes the collection and reporting of metadata essential for SIP experiments. This significance will be elucidated in the introduction to underscore the necessity and relevance of the MISIP framework.
  
  The "Excess Atom Fraction" (EAF) serves as a pivotal metric for evaluating the isotopic enrichment of specific taxa, genomes, or genes in SIP experiments. It plays a crucial role in quantifying the incorporation of isotopically labeled substrates into microbial biomass, thereby providing valuable insights into microbial community dynamics and functional gene expression.
  
  The introduction section will be expanded to provide a comprehensive background on DNA/RNA-stable isotope probing (SIP) technology, emphasizing the need for standardized data reporting through the MISIP framework. This contextualization will elucidate the motivation behind the development of MISIP and underscore its significance in promoting data reuse and reproducibility in SIP research.
  
  To enhance transparency and credibility, a detailed account of the development process of the MISIP data standard, including the methodologies employed and potential challenges encountered, will be incorporated. This supplementary information will provide readers with insights into the rigor and practicality of the standard.
  
  Specific application cases showcasing the efficacy of the MISIP data standard in actual research scenarios will be integrated into the manuscript. These case studies will serve to illustrate the practical utility of MISIP and bolster the persuasiveness of the article.
  
  A comparative analysis of the MISIP data standard with existing similar standards will be conducted to highlight its advantages and uniqueness. This comparative approach will furnish readers with a comprehensive understanding of the distinctive features and benefits of MISIP.
  
  The article will delve into the limitations of the MISIP data standard, explore potential avenues for future improvement, and delineate its application prospects in fields such as microbial ecology. This discussion will offer critical insights into the current state and future trajectory of MISIP.
  
  The manuscript will be supplemented with a thorough examination of the limitations of MISIP data standards, potential avenues for future enhancement, and its implications for microbial ecology and other relevant fields. This holistic approach will ensure that the article comprehensively addresses all facets of the MISIP framework.
  
  Re-review: Overall, the author addressed the questions I raised; however, the existing SIP research overlooked some representative authors I consider important, such as Thomas, F. (SME J. 2019;13:1814-30) and Luo, CL (Environ Int. 2023;180:108215). The author should include a more thorough review of the relevant literature to provide a well-rounded context for the study. Additionally, I identified several formatting errors in the manuscript, such as the incorrect citation in reference 15. These errors should be rectified to meet the journal's standards.
3. GigaScience 23 Nov 2024
  
  in GigaScience
  
  This work has been published in GigaScience Journal under a CC-BY 4.0 license (https://doi.org/10.1093/gigascience/giae071). These reviews are as follows.
  
  Reviewer 1. Dayi Zhang The topic is quite interesting and important for microbiologists who are doing SIP work. However, there are some concerns about its quality and novelty. 1. 15N is widely used in SIP but the authors did not mention them in this work. As an important labelling isotope, it is not acceptable to exclude 15N work. 2. The authors have well designed and explained the catalog of MISIP, but how to standardize data from different sources are is not mentioned. In other words, there is only a method to put information together but no protocol to compare data from different studies or extract useful information from others' work. I think this is the most important expectation of this work. 3. As different protocols were used by different researchers to achieve SIP results, the authors should give criteria for their quality and the way to improve the quality for comparison. However, I cannot find such information. 4. For the reason above, I think this is only a very preliminary concept, and the datasets and methods should be further developed for practical purposes.
  
  ---Editors Comments--- This work was then rejected to allow more work and revision. and then resubmitted.
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2023.07.13.548835v1
www.biorxiv.org www.biorxiv.org

Whole-genome re-sequencing of the Baikal seal and other phocid seals for a glimpse into their genetic diversity, demographic history, and phylogeny

2
1. GigaScience 21 Nov 2024
  
  in GigaByte
  
  Editors Assessment:
  
  Due to them being found in the landlocked, isolated habitat of Lake Baikal makes the Baikal Seal (Pusa sibirica) unique among all pinnipeds as the only freshwater seal. This paper presents reference-based assemblies of six newly sequenced Baikal seal individuals, one individual of the ringed seal, as well as the first short-read data of the harbor seal and the Caspian seal . This data aiding the study of the genomic diversity of the Baikal seal and to contribute baseline data to the limited genomic data available for seals. Peer review extended the description of the used tools and parameters in the revised manuscript, and provided some more information on the methods..This newly generated sequencing data hopefully now helps to extend the phylogeny of the Phoca/Pusa group on genome-wide data and can also broaden the view into the genetic structure and diversity of the Baikal seal
  
  This evaluation refers to version 1 of the preprint
  
  Summary
2. GigaScience 21 Nov 2024
  
  in GigaByte
  
  AbstractBackground: The iconic Baikal seal (Pusa sibirica), the smallest true seal, is a freshwater seal that is endemic to Lake Baikal where it became landlocked some million years ago. It is a rather abundant species of least concern, despite the limited habitat. Until recently, research on its genetic diversity has only been done on mitochondrial genes, restriction fragment analyses, and microsatellites, before its reference genome has been published. Findings: Here we report the genome sequences of six Baikal seals, and one individual of the Caspian seal, ringed seal, and the harbor seal, re-sequenced from Illumina paired-end short read data. Heterozygosity calculations of the six newly sequenced individuals are like the previously reported genomes. In addition, the novel genome data of the other species contributed to a more complete phocid seal phylogeny based on whole-genome data. Conclusions: Despite the long isolation of the land-locked Baikal seal population, the genetic diversity of this species is in the same range as other seal species. The Baikal seal appears to form a single, diverse population. However, targeted genome studies are needed to explore the genomic diversity throughout their distribution.
  
  This work has been published in GigaByte Journal under a CC-BY 4.0 license (https://doi.org/10.46471/gigabyte.142). These reviews are as follows.
  
  Reviewer 1. Yaolei Zhang
  
  Overall, the newly generated data from this study are valuable, but the authors have not effectively analyzed and interpreted the data. The entire paper appears to be more like an undergraduate bioinformatics homework exercise, with the results resembling a middle school student's description of a picture. Additionally, there are several major issues: 1. Background investigation is not sufficient: Genomic data on the Baikal seal has been publicly available five years ago, including a chromosome-level genome assemby with much higher quality, such as contig N50, which is nearly ten times higher than the reference genome used by the author in this study. 2. Methodology is unclear: The description of the software and parameters used is incomplete. A proper methodological description should allow a basic bioinformatics analyst to quickly reproduce the results of the paper. However, with the current description, there are too many missing details in the methodology section. 3. Data issues: • a. For publicly available data, the authors did not provide detailed descriptions of the accession numbers. • b. For the newly generated data in this study, the author did not sufficiently describe the data quality to support their conclusions. • c. In the supplementary table, the author show 100% mapping rates of sequencing reads for all samples. Having worked on numerous resequencing projects, I have rarely encountered 100% mapping rates, especially when aligning to different species. The author should check this. 4. Basic analytical skill/experience is lacking: For example, the PSMC analysis, sequencing depth can directly affect the results, but the author did not consider this issue and proceeded to compare curves generated from different sequencing depths directly. Additionally, how was the mutation rate (μ) derived? The generation time is only mentioned as coming from IUCN, but values are not provided in the paper. Moreover, in the genetic diversity section, is calculating heterozygosity only sufficient to be considered a measure of genetic diversity? Hope the author to read some re-sequencing papers thoroughly Re-review: The authors carefully addressed most of my concerns. Although I still doubt about the mapping rate (I did no find the mapping report attached), I am happy to accept this manuscript.
  
  Reviewer 2. Stephen Gaughran
  
  Are all data available and do they match the descriptions in the paper? Yes. NCBI numbers should be added when available.
  
  Comments: I would recommend using a lower mutation rate for seals: de novo mutation rates around 7e-9 have been measured for a few pinniped species. Line 129: I think you mean kya here (not Ma). Line 160: I think this should be "an average value of 0.066%"
Visit annotations in context

Tags

Summary

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2024.10.19.619210v1
www.biorxiv.org www.biorxiv.org

TSTA: Thread and SIMD-Based Trapezoidal Pairwise/Multiple Sequence Alignment Method

2
1. GigaScience 21 Nov 2024
  
  in GigaByte
  
  Editors Assessment:
  
  The article presents strategies for accelerating sequence alignment using multithreading and SIMD (Single Instruction, Multiple Data) techniques, and introduces a new algorithm called TSTA (Thread and SIMD-Based Trapezoidal Pairwise/Multiple Sequence-Alignment). The Technical Release write-up presenting a detailed description of TSTA's performance in pairwise sequence alignment (PSA) and multiple sequence alignment (MSA), and compares it with various existing alignment algorithms. Demonstrating the performance gains achieved by vectorized SIMD technology and the application of threading. Testing and debugging a few errors, and adding some more background detail, demonstrating it can achieve faster comparison speed. Demonstrating TSTA's efficacy in pairwise sequence alignment and multiple sequence alignment, particularly with long reads, and showcasing considerable speed enhancements compared to existing tools.
  
  This evaluation refers to version 1 of the preprint
  
  Summary
2. GigaScience 21 Nov 2024
  
  in GigaByte
  
  AbstractsThe rapid advancements in sequencing length necessitate the adoption of increasingly efficient sequence alignment algorithms. The Needleman-Wunsch method introduces the foundational dynamic programming (DP) matrix calculation for global alignment, which evaluates the overall alignment of sequences. However, this method is known to be highly time-consuming. The proposed TSTA algorithm leverages both vector-level and thread-level parallelism to accelerate pairwise and multiple sequence alignments.
  
  This work has been published in GigaByte Journal under a CC-BY 4.0 license (https://doi.org/10.46471/gigabyte.141). These reviews are as follows.
  
  Reviewer 1. Xingbao Song and Baoxing Song
  
  Zong et al. implemented a TSTA package that integrated the difference method, the stripe method, SIMD, and multiple threading approaches to perform efficient sequence alignments. The TSTA toolkit could conduct pairwise and multiple sequence alignments. The memory cost of TSTA is comparable with the most efficient one. Overall, TSTA is a good package, and the manuscript is well-written. While I have a few suggestions: 1) The minimap2 should be mentioned in the section on "difference recurrence relation." It has a much broader range of users and implemented an algorithm that is slightly different from the one by Suzuki, etc. 2) The striped SIMD is also implemented in reads mappers, such as BWA. 3) Page 14, line 215 "1k bps", line 227 "1000 kbps", line 230 and table1 "100k". They should be consistent. 4) In Table 4, I am not sure I understood the second and third lines correctly. Please clarify. 5) I tried to compile TSTA from the source code. To compile the package, I had to copy 'seqio.h' into the 'msa' and 'psa' folders. Please fix it.
  
  Reviewer 2. Yuansheng Liu
  
  The article explores strategies for accelerating sequence alignment using multithreading and SIMD (Single Instruction, Multiple Data) techniques, and introduces a new algorithm called TSTA. The paper provides a detailed description of TSTA's performance in pairwise sequence alignment (PSA) and multiple sequence alignment (MSA), and compares it with various existing alignment algorithms. Experimental results indicate that TSTA demonstrates significant speed advantages, particularly when handling long sequences and in the no-backtracking mode. However, the experiments on MSA are limited by the experimental environment, which does not fully address the needs of current sequencing technologies concerning long reads and depth. Specifically, the low number of sequences in MSA does not meet the requirements for downstream genomic analysis applications. While the algorithm is highly innovative, its performance on short sequences and during the backtracking phase still requires optimization. 1. In line 7, the TSTA algorithm utilizes vector-level and thread-level parallelism to accelerate pairwise and multiple sequence alignment. Why are there no experiments designed specifically to evaluate the global alignment performance of TSTA with vector-level parallelism? Or are there any other experimental designs that demonstrate the improved performance of TSTA when vector-level parallelism is employed? 2. In line 149, is the Active-F method used by the TSTA algorithm contributing to the excessive memory usage and access time overhead observed during the iterative process of PSA? Are there better optimization strategies from this perspective? If not, why does TSTA incur higher time costs in traceback as shown in Table 1? Why does bsalign result in lower time consumption? 3. Can you provide the time breakdown for each part of the parallel computation in TSTA for PSA (including at least CPU computation overhead, communication overhead, and I/O overhead) to clarify if there will be significant communication overhead issues with larger datasets and more threads? 4. Table 2 shows that both real and simulated datasets have issues with insufficient depth and short reads. In real MSA processes, it is common to encounter comparisons with depth over 60X and lengths exceeding 100 kbps for long reads. The results under the current experimental conditions seem to perform poorly for such data scenarios. Can you address this? 5. Gene data often includes repetitive regions that affect the accuracy of alignment algorithms. Can you design experiments to verify how TSTA performs in aligning long repetitive regions? Specifically, how accurately does TSTA align sequences in such regions compared to other methods? 6. Besides repetitive regions, sequencing errors produced by ONT R10 chips can also impact alignment accuracy. Alignment algorithms used in genome correction often struggle to detect such errors. How does TSTA handle such issues during MSA? Can the algorithm be designed to address these sequencing errors more effectively? Re-review: After thoroughly reviewing the revised manuscript and testing the TSTA tool, I cannot endorse the manuscript for publication in its current form. I encourage the authors to address the following issues thoroughly and consider re-submitting after significant improvements. Efficiency Concerns: In the context of multiple sequence alignment (MSA), I find that TSTA does not demonstrate a significant advantage in terms of efficiency. I conducted a test with approximately 2G of homologous diploid reads (not too large data), and the tool has been running for around 29 hours. Despite this extensive runtime, the process remains incomplete. This is far from the efficiency one would expect from a tool designed for large-scale sequence alignment. Functionality Issues: There are still unresolved issues with the tool's functionality. The -f parameter does not appear to work as intended, and there are also problems with the -o parameter. Such issues need to be addressed to ensure the tool's reliability and usability.
Visit annotations in context

Tags

Summary

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2024.09.18.613655v1
www.biorxiv.org www.biorxiv.org

Chromosome-level genome assembly and annotation of the crested gecko, Correlophus ciliatus, a lizard incapable of tail regeneration

2
1. GigaScience 21 Nov 2024
  
  in GigaByte
  
  Editors Assessment:
  
  The crested gecko (Correlophus ciliatus), is a lizard species endemic to New Caledonia, and a potentially interesting model organism due to its unusual (for a gecko) inability to regenerate amputated tails. With that in mind here is presented a new reference genome for the species, assembled using PacBio Sequel II platform and Dovetail Omni-C libraries. Producing a genome with a total size of 1.65 Gb, 152 scaffolds, a L50 of 6, and N50 of 109 Mb. Peer review making sure more detail was added on data acquisition and processing to enhance reproducibility. In the end producing potentially useful data for studying the genetic mechanisms involved in loss of tail regeneration.
  
  This evaluation refers to version 1 of the preprint
  
  Summary
2. GigaScience 21 Nov 2024
  
  in GigaByte
  
  AbstractThe vast majority of gecko species are capable of tail regeneration, but singular geckos of Correlophus, Uroplatus, and Nephrurus genera are unable to regrow lost tails. Of these non-regenerative geckos, the crested gecko (Correlophus ciliatus) is distinguished by ready availability, ease of care, high productivity, and hybridization potential. These features make C. ciliatus particularly suited as a model for studying the genetic, molecular, and cellular mechanisms underlying loss of tail regeneration capabilities. We report a contiguous genome of C. ciliatus with a total size of 1.65 Gb, a total of 152 scaffolds, L50 of 6, and N50 of 109 Mb. Repetitive content consists of 40.41% of the genome, and a total of 30,780 genes were annotated. Assembly of the crested gecko genome provides a valuable resource for future comparative genomic studies between non-regenerative and regenerative geckos and other squamate reptiles.Findings We report genome sequencing, assembly, and annotation for the crested gecko, Correlophus ciliatus.
  
  This work has been published in GigaByte Journal under a CC-BY 4.0 license (https://doi.org/10.46471/gigabyte.140). These reviews are as follows.
  
  Reviewer 1. Anthony Geneva and Cleo Falvey
  
  In their revised manuscript Gumangan and colleagues have addressed each of the comments we made on the original manuscript via substantial revisions. We appreciate the improvements the authors have made but feel there are a few remaining issues that require attention, detailed below. Those issues notwithstanding, this new assembly and annotation represent a very nice contribution to the field and will certainly be widely used.
  
  Specific comments: Pages 2 and 6: Each time L50 or L90 statistics are reported they are listed with the units “bp”. These values are counts of scaffolds are are typically simply reported as integers without units. Page 3. “Furthermore, C. ciliatus is the only non-regenerative lizard species capable of hybridizing with regenerative relatives, specifically C. sarasinorum, Mniarogekko chahoua, and Rhacodactylus auriculatus.” This statement is very interesting but requires a reference or at least attribution of some kind (eg - personal observation by one of the co-authors). Page 3: “Genomic DNA was sequenced using the Illumina Novaseq 6000 platform. 185.8 gigabase-pairs of PacBio CCS reads were used as inputs to Hifiasm v0.15.4-r347 [8] with default parameters.” The sequencer listed here for generating long reads seems to be an error and should be some PacBio platform (Sequel, Sequel IIe, etc). Page 6: “The contig/scaffold N50 is 109 Mb, and the largest scaffold had a length 1169 Mbp (Table 1)”. 1169 should be 169.
  
  Reviewer 2. Zexian Zhu
  
  Review comments are in the following link: https://gigabyte-review.rivervalleytechnologies.comdownload-api-file?ZmlsZV9wYXRoPXVwbG9hZHMvZ3gvRFIvNTU3L3Jldmlldy5kb2N4
  
  Reviewer 3. Chaochao Yan
  
  Are all data available and do they match the descriptions in the paper? No. In the section "Availability of supporting data," it is stated that "supporting datasets, including annotation, are available at GigaDB." However, I was unable to locate these datasets during my search. Could you please provide a direct link or the accession number to access these resources?
  
  Is the data acquisition clear, complete and methodologically sound? No. The manuscript currently lacks detailed information regarding the samples and data used to assemble and annotate the reference genome. For instance, it does not specify how many samples or libraries were used for RNA-Seq or whole-genome sequencing. I suggest including a table that provides comprehensive information on the samples and sequencing data. Additionally, while the manuscript mentions that "Genomic DNA was sequenced using the Illumina Novaseq 6000 platform," the corresponding Illumina data are not described. I am unclear about how the PacBio CCS reads were produced. Could you please provide more details or clarify the methodology used to generate these reads?
  
  Is there sufficient detail in the methods and data-processing steps to allow reproduction? No. Some methods described in the manuscript lack sufficient detail, particularly for tools such as BLAST, BlobTools, HiRise, and BWA. To ensure reproducibility, I recommend providing the specific parameters used for these analyses.
Visit annotations in context

Tags

Summary

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2024.09.28.615630v1
www.biorxiv.org www.biorxiv.org

SMARTER-database: a tool to integrate SNP array datasets for sheep and goat breeds

2
1. GigaScience 21 Nov 2024
  
  in GigaByte
  
  Editors Assessment:
  
  This paper presents the SMARTER database, a collection of tools and scripts to gather, standardize, and share with the scientific community a comprehensive dataset of genomic data and metadata information on worldwide small ruminant populations. Which has come out of the EU multi-actor (12 country) H2020 project called SMARTER: SMAll RuminanTs breeding for Efficiency and Resilience. This bringing together genotypes for about 12,000 sheep and 6,000 goats, alongside phenotypic and geographic information. The paper providing insight into how the database was put together, presenting the code for the SMARTER—frontend, backend and API, alongside instructions for users. Peer review tested the platform and provided suggestions on improving the metadata. Demonstrating the project provides valuable information on sheep and goat populations around the world, that can be an essential tool for ruminant researchers. Enabling them to generate new insights and offer the possibility to store new genotypes and drive progress in the field.
  
  This evaluation refers to version 1 of the preprint
  
  Summary
2. GigaScience 21 Nov 2024
  
  in GigaByte
  
  AbstractUnderutilized sheep and goat breeds have the ability to adapt to challenging environments due to their genetic composition. Integrating publicly available genomic datasets with new data will facilitate genetic diversity analyses; however, this process is complicated by important data discrepancies, such as outdated assembly versions or different data formats. Here we present the SMARTER-database, a collection of tools and scripts to standardize genomic data and metadata mainly from SNP chips arrays on global small ruminant populations with a focus on reproducibility. SMARTER-database harmonizes genotypes for about 12,000 sheep and 6,000 goats to a uniform coding and assembly version. Users can access the genotype data via FTP and interact with the metadata through a web interface or programmatically using their custom scripts, enabling efficient filtering and selection of samples. These tools will empower researchers to focus on the crucial aspects of adaptation and contribute to livestock sustainability, leveraging the rich dataset provided by the SMARTER-database.
  
  This work has been published in GigaByte Journal under a CC-BY 4.0 license (https://doi.org/10.46471/gigabyte.139). These reviews are as follows.
  
  Reviewer 1. Ran Li
  
  The authors presented an online SMARTER-database, which collected a large number of genotype data for sheep and goats. The resources are of great importance for the community.
  
  My major concerns: 1) The below link is not accessible: webserver.ibba.cnr.it 2) For sheep, the database support reference genome assembly of Oar3 and Oar4, but actually Oar 3 is rarely used. Instead, the current ovine reference genome assembly (ARS-UI_Ramb_v3.0) is not supported. 3) For the presentation of metadata (https://webserver.ibba.cnr.it/smarter/breeds?species=Sheep), I suggest additional columns indicating the region and country should be provided. 4) For the datasets (https://webserver.ibba.cnr.it/smarter/datasets), references are needed to know where the data are from.
  
  Re-review:
  
  My comments have been properly addressed. The manuscript is acceptable for publication.
  
  Reviewer 2. Hans Lenstra and Johannes A. Lenstra
  
  Is there a clear statement of need explaining what problems the software is designed to solve and who the target audience is? Yes. This is implicitly clear and does not need to elaborate upon.
  
  As Open Source Software are there guidelines on how to contribute, report issues or seek support on the code? No. This does not to seem necessary.
  
  Is the code executable? unable_to_test Is installation/deployment sufficiently outlined in the paper and documentation, and does it proceed as outlined? unable_to_test Is the documentation provided clear and user friendly? Yes. I did not test this.
  
  Is there a clearly-stated list of dependencies, and is the core functionality of the software documented to a satisfactory level? No. I did not see such a list, but I would not be able to assess this.
  
  Have any claims of performance been sufficiently tested and compared to other commonly-used packages? not_applicable
  
  Is automated testing used or are there manual steps described so that the functionality of the software can be verified? No. I did not find any of this but it does not seem to be essential.
  
  Additional Comments: This manuscript describes a highly useful database of sheep and goat genome-wide SNP genotypes from several sources, supplemented with phenotypes and geographic locations. I recommend this manuscript for publication in Gigascience after a revision. There is some missing information, whereas the presentation should become less cryptic to readers who are less familiar with the bioinformatic terminology. Missing info. 1. The title and abstract do not mention that SMARTER focuses on SNPs that are genotyped on bead arrays or related technologies. The focus on the genome-wide (GW) SNP genotypes, which only partially represents the total genomic diversity, should already be clear from the Title and the Abstract. 2. Nowadays there are more publications on WGS data, T2T sequences and pangenomes than on GW SNP genotypes, so people may wonder if the GW SNP genotypes still are useful. It may be emphasized that bead-arrays allow an affordable analysis of many animals and that genotypes derived from WGS data contain many false homozygote scores if not sequenced at a very high coverage. 3. Figures 2 and 3 give an idea of the geographic coverage, but what is the unit of the numbers that are visualized in the heat map (0 to 2300 for sheep, 0 to 1100 for goats)? 4. It is not clear which published data have been used or not. We recommend presenting a supplemental table describing the current contents: country, breed, number of animals, number of SNPs (at least 50K or HD), reference. 5. Is there an organized effort to update the database, which ideally should contain all published GW SNP databases? 6. To my experience for most HW SNP datasets only the filtered data after quality control (typically 45 to 49K, less than 42K if sheep 50K and HD genotypes are combined) are available. How is this handled? 7. It may be mentioned that after omission of A/T and G/C SNPs the TOP strand consists only of A/C and A/G SNPs. 8. The problematic SNPs are mentioned twice within the last paragraph of the section Data Composition. 9. Does SMARTER allow to store phased datasets and show the variant haplotypes? These can now be generated by long-read sequencing and are needed for several downstream analysis options. 10. Table 1: OAR3 = Oar_v3.1 and OAR4 = Oar_v4.0? Please use the official codes. 11. Are there options to convert the data to newer assemblies? For instance, the sheep ARS-UI_Ramb_v3.0 is superior to Oar_v4.0. I have used an NCBI tool for conversion of Oar_v1.0 (most popular for 50K datasets) and Oar_3.1 (used often for sheep HD datasets) to Oar_v4.0, but this tool has probably been discontinued and was not available for goat assemblies. 12. I repeatedly found that most published or unpublished databases contain several errors such as duplicates and outliers by mislabeling or crossbreeding. Because these are better removed prior to downstream analysis, data curation would be desirable, for instance by an inspection of a NJ tree of individuals. This also shows the degree of breed-level differentiation, for instance the relationships of different populations of a transboundary breed. These caveats should at least be mentioned. 13. Another caveat: is there a systematic check on the validity of the merging of datasets by testing if breeds sampled independently by different institutes cluster closely together? Presentation. 14. Abbreviations should not be used in abstract. What is “REST API”? These abbreviations of course are in the list, but what is “Representational State Transfer”? And “JSON Web Token”? 15. Figure 1 needs more guidance via the legend. The boxes show alternative formats? What are “str”, “dict “? 16. Figure 5 is useful and seems to retrieve data for the goat Alpine and Bionda dell'Adamello breeds. It would also be useful to show other “API-URL” (this is user input?) while describing in plain language what is being accomplished. 17. Figure 6: bold indicates the user input? What is exactly a “array [string]” (give an example)? A few other examples may be most instructive and familiarize the reader with the logic of SMARTER. 18. In the section “The SMARTER-database project”: what is a mongoengine? 19. In the same section: “Finally the VariantSpecie abstract class is inherited by . . .”: this sentence is difficult to understand. 20. In the section Reproducibility: please give a short description of what is the use of the Conda and Docker programs. 21. Same section: “Raw data undergoes initial exploration”, “structure and potential issues”: can you be more specific? The last part of this section is also difficult to follow.
  
  Re-review: This paper presents the SMARTER database, a collection of tools and scripts to gather, standardize, and share with the scientific community a comprehensive dataset of genomic data and metadata information on worldwide small ruminant populations. Which has come out of the EU multi-actor (12 country) H2020 project called SMARTER: SMAll RuminanTs breeding for Efficiency and Resilience. This bringing together genotypes for about 12,000 sheep and 6,000 goats, alongside phenotypic and geographic information. The paper providing insight into how the database was put together, presenting the code for the SMARTER—frontend, backend and API, alongside instructions for users. Peer review tested the platform and provided suggestions on improving the metadata. Demonstrating the project provides valuable information on sheep and goat populations around the world, that can be an essential tool for ruminant researchers. Enabling them to generate new insights and offer the possibility to store new genotypes and drive progress in the field.
Visit annotations in context

Tags

Summary

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2024.09.01.610681v2
www.biorxiv.org www.biorxiv.org

NucBalancer: Streamlining Barcode Sequence Selection for Optimal Sample Pooling for Sequencing

2
1. GigaScience 21 Nov 2024
  
  in GigaByte
  
  Editors Assessment:
  
  This paper presents NucBalancer, a R-pipeline and Shiny app designed for the optimal selection of barcode sequences for sample multiplexing in sequencing. Providing a user-friendly interface aiming to make this process accessible to both bioinformaticians and experimental researchers, enhancing its utility in adapting libraries prepared for one sequencing platform to be compatible with others. Important now with the introduction of additional sequencing platforms by Element Biosciences (AVITI System) and Ultima Genomics (UG100) increasing the diversity and capability of genomic research tools available. NucBalancer’s incorporation of dynamic parameters, including customizable red flag thresholds, allows for precise and practical barcode sequencing strategies. This adaptability is key in ensuring uniform nucleotide distribution, particularly in MGI sequencing and single-cell genomic studies, leading to more reliable and cost-effective sequencing outcomes across various experimental conditions. All the code is available under an open source license, and upon review the authors have also shared the code for the Shiny app.
  
  This evaluation refers to version 1 of the preprint
  
  Summary
2. GigaScience 21 Nov 2024
  
  in GigaByte
  
  AbstractRecent advancements in next-generation sequencing (NGS) technologies have brought to the forefront the necessity for versatile, cost-effective tools capable of adapting to a rapidly evolving landscape. The emergence of numerous new sequencing platforms, each with unique sample preparation and sequencing requirements, underscores the importance of efficient barcode balancing for successful pooling and accurate demultiplexing of samples. Recently launched new sequencing systems claim better affordability comparable to more established platforms further exemplifies these challenges, especially when libraries originally prepared for one platform need conversion to another. In response to this dynamic environment, we introduce NucBalancer, a Shiny app developed for the optimal selection of barcode sequences. While initially tailored to meet the nucleotide, composition challenges specific to G400 and T7 series sequencers, NucBalancer’s utility significantly broadens to accommodate the varied demands of these new sequencing technologies. Its application is particularly crucial in single-cell genomics, enabling the adaptation of libraries, such as those prepared for 10x technology, to various sequencers including G400 and T7 series sequencers. By facilitating the efficient balancing of nucleotide composition and the accommodation of differing sample concentrations, NucBalancer plays a pivotal role in reducing biases in nucleotide composition. This enhances the fidelity and reliability of NGS data across multiple platforms. As the NGS field continues to expand with the introduction of new sequencing technologies, the adaptability and wide-ranging applicability of NucBalancer render it an invaluable asset in genomic research. This tool addresses the current sequencing challenges ensuring that researchers can effectively balance barcodes for sample pooling regardless of the sequencing platform used.
  
  This work has been published in GigaByte Journal under a CC-BY 4.0 license (https://doi.org/10.46471/gigabyte.138). These reviews are as follows.
  
  Reviewer 1. Aamir Khan
  
  Is there a clear statement of need explaining what problems the software is designed to solve and who the target audience is?
  
  Yes. The tool has novel features not reported in previous tools for barcoding.
  
  Is the source code available, and has an appropriate Open Source Initiative license been assigned to the code?
  
  Yes. The tool is available as an R script as well as a shiny app.
  
  Is installation/deployment sufficiently outlined in the paper and documentation, and does it proceed as outlined? Yes. I would suggest mentioning a few features that are novel or superior to other tools. Perhaps adding a table specifying these novel features that are not part of existing tools will add value to MS.
  
  Is the documentation provided clear and user friendly?
  
  Yes. The documentation is provided in a clear and user-friendly way. The input file formats are given in the GitHub page. It would be better to add an example to the shiny app page.
  
  Yes. Is there a clearly-stated list of dependencies, and is the core functionality of the software documented to a satisfactory level? Yes. Dependencies are mentioned on the tool documentation page and can be installed if R is already installed.
  
  Additional Comments: The authors have a well-written MS describing the NucBalancer tool. The tool adds value for sequencing by pooling samples and will be useful as we make technological advancements in the sequencing space.
  
  Reviewer 2. Hugo Varet
  
  Is there a clear statement of need explaining what problems the software is designed to solve and who the target audience is?
  
  Yes. The manuscript explains the constraints to be satisfied when looking for barcodes but more details about the context (Illumina chemistry for instance) would be appreciated. Moreover, is the software compatible with dual-indexing?
  
  Is the source code available, and has an appropriate Open Source Initiative license been assigned to the code?
  
  Yes. The source code of the program is available on GitHub as a R script, but the source code of the Shiny application is not available.
  
  As Open Source Software are there guidelines on how to contribute, report issues or seek support on the code?
  
  Yes. Support can be asked by email to the authors as stated at the end of the README on GitHub.
  
  Is installation/deployment sufficiently outlined in the paper and documentation, and does it proceed as outlined?
  
  Yes. The example command line works well. However, the R script needs shiny and xtable packages to be loaded even if none of their functions is actually called in the script.
  
  Is the documentation provided clear and user friendly?
  
  No. A detailed documentation would improve the application proposed. In particular, more details about the different chemistries used by Illumina, MGI... and the constraints to find compatible barcodes.
  
  Is there a clearly-stated list of dependencies, and is the core functionality of the software documented to a satisfactory level?
  
  No. The strategy used to find barcodes seems very simple, but more details would improve the manuscript.
  
  Have any claims of performance been sufficiently tested and compared to other commonly-used packages?
  
  No. The manuscript cites several packages developed to find compatibles sequencing barcodes but the performances are not compared. Moreover, we do not know if NucBalancer still work with a high number of samples/barcodes.
  
  Are there (ideally real world) examples demonstrating use of the software?
  
  No. A real world example would be appreciated to illustrate the software, especially in a scenario where the other cited solutions were not able to find compatible barcodes.
  
  Is automated testing used or are there manual steps described so that the functionality of the software can be verified?
  
  No.
  
  Additional Comments: I would suggest the authors to improve the design of the Shiny app as (at the moment) it only runs a R script and prints the result. Moreover, I think the quality of the R code could be easily improved (e.g. loops with strange counters or comparisons with booleans).
  
  Re-review: I thank the authors for the improvements they made on this new version of the manuscript. At this stage, I'm not totally satisfied for the following reasons: - authors tell the source code of the Shiny app is now available on GitHub, but I have not been able to find it. - in the manuscript, the sentence "The tool does not have any dependency other than the utilities from the base R package" is no longer true as the tool now uses optparse. - in table 1, checkMyIndex is referenced with no web interface available white it actually exists (https://checkmyindex.pasteur.fr/). Moreover, the proposed web interface could still be improved. For instance: - it would be great to add something to show the algorithm is currently looking for a solution. - check the input files have a valid structure to be used. - display the input files when they are loaded to make sure the user uploaded the correct file.
  
  Reviewer 3. Wen Yao
  
  The authors reported a new tool for barcode sequences design. This tool is developed using R/Shiny and is available for using online. Below are my comments for further improvement of the manuscript and the tool. 1. Please provide a “load example data” button in the Shiny app. With this button, the example data can be easily loaded by the users for testing NucBalancer. 2. This URL (http://146.118.68.98:8888/) for NucBalancer should also be given in the manuscript. 3. The “Download Table” button is not working. 4. Format of the input data should be checked, as input data in wrong format caused the NucBalancer to crash. 5. The authors should compare NucBalancer with published similar tools in this field. More details are required.
  
  Re-review: The authors have addressed all my concerns.
Visit annotations in context

Tags

Summary

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2024.09.06.611747v1
www.biorxiv.org www.biorxiv.org

V-pipe 3.0: a sustainable pipeline for within-sample viral genetic diversity estimation

2
1. GigaScience 10 Nov 2024
  
  in GigaScience
  
  The large amount and diversity of viral genomic datasets generated by next-generation sequencing technologies poses a set of challenges for computational data analysis workflows, including rigorous quality control, adaptation to higher sample coverage, and tailored steps for specific applications. Here, we present V-pipe 3.0, a computational pipeline designed for analyzing next-generation sequencing data of short viral genomes. It is developed to enable reproducible, scalable, adaptable, and transparent inference of genetic diversity of viral samples. By presenting two large-scale data analysis projects, we demonstrate the effectiveness of V-pipe 3.0 in supporting sustainable viral genomic data science.Competing Interest StatementThe authors have declared no competing interest.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giae065), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license.
  
  Reviewer: Shilpa Garg
  
  V-pipe 3.0 is introduced as an advanced computational pipeline tailored for the analysis of nextgeneration sequencing data from short viral genomes. Designed to meet the challenges posed by the vast and diverse datasets generated by these technologies, V-pipe 3.0 emphasizes reproducibility, scalability, adaptability, and transparency. It achieves this by adhering to Snakemake's best practices, allowing easy swapping of virus-specific configuration files, and providing thoroughly tested examples online.
  
  The utility of V-pipe 3.0 is showcased through its application in two extensive data analysis projects, proving its efficacy in sustainable viral genomic data science. Central to V-pipe 3.0 is its capacity for estimating viral diversity from sequencing data. A versatile benchmarking module has been developed to continuously assess various diversity estimation methods, accommodating the rapid advancements within this field. The pipeline simplifies the inclusion of new tools and datasets, supporting both synthetic and real experimental data. However, challenges in global haplotype reconstruction highlight the need for scalable methods that can accurately reflect the complex population structures of viruses and manage the uncertainties in the results.
  
  Some additional clarification in the manuscript would be appreciated. 1) I'm curious about how the efficiency is attained. 2) Is it possible to utilize V-pipe for analyzing other microorganisms? 3) The authors might consider directing readers to the following review article for reference: https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02328-9 4) Identifying specific genes or genome regions with high polymorphism across different populations would be fascinating. How does V-pipe handle analysis in these highly variable regions?
2. GigaScience 10 Nov 2024
  
  in GigaScience
  
  AbstractThe large amount and diversity of viral genomic datasets generated by next-generation sequencing technologies poses a set of challenges for computational data analysis workflows, including rigorous quality control, adaptation to higher sample coverage, and tailored steps for specific applications. Here, we present V-pipe 3.0, a computational pipeline designed for analyzing next-generation sequencing data of short viral genomes. It is developed to enable reproducible, scalable, adaptable, and transparent inference of genetic diversity of viral samples. By presenting two large-scale data analysis projects, we demonstrate the effectiveness of V-pipe 3.0 in supporting sustainable viral genomic data science.Competing Interest StatementThe authors have declared no competing interest.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giae065), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license.
  
  Reviewer: Fotis Psomopoulos
  
  The manuscript showcases a computational pipeline designed for analyzing next generation sequencing data of short viral genomes, namely V-pipe 3.0. After an overview of the challenge the tool is addressing, i.e. the necessity of continuous benchmarking of various methods due to their diverse performance across different scenarios,the paper continues with a detailed listing of the results, highlighting the key elements of Reproducibility, Scalability, Adaptability and Transparency.
  
  The next section provides some details on the three applications / demonstrations of V-Pipe 3.0, i.e the Swiss SARS-CoV-2 Sequencing Consortium, the Swiss surveillance of SARS-CoV-2 genomic variants in wastewater and the Global haplotype reconstruction benchmark. This is followed by a comprehensive comparison of V-Pipe 3.0 Ï„Î¿ other relevant viral bioinformatics pipelines for within sample diversity estimation, focusing on functionalities and sustainability, and specifically nf-core/viralrecon, HAPHPIPE and ViralFlow, as well as a section discussing the main advantages of V-Pipe 3.0 as well as the rationale for some of the identified drawbacks.
  
  The paper concludes with a thorough description of the underlying methods of V-Pipe 3.0 as well as on the data used. Overall the paper gives a very good presentation of V-Pipe, and makes a strong case about its use and value in a real-world challenge. An overall comment is that there is some confusion on the role of V-Pipe 3.0 as a workflow - i.e. whether it's a dynamic system that uses different tools per step based on user input, or if it's an automated systems that benchmarks the analysis using (e.g.) synthetic data as the baseline. In either case, there are also a few unclear points in the manuscript itself that could be further improved.
  
  Specifically: -- It is not clear how V-pipe 3.0 differs from V-pipe. Although there is an indication of significant differences, an overview of the new features implemented in this version and/or a small introductory paragraph would be useful. -- In the "Results" section, lines 130 - 225 appear to refer to the implemented methodology and might be better served as part of the "Methods" section -- In the "Results" section, lines 135 - 138 implied that GitHub Actions are used to ensure Reproducibility of the workflow. Some more elaboration on this would be very useful, as GitHub actions are commonly used to automate processes (such as testing, conflict resolution etc). In particular, an reproducibility issue that might not be resolvable by GitHub actions are dependency conflicts that are specific to the particular system that is being tested. -- In the "Results" section, lines 139 - 146, it's not clear how the benchmark study contributes to the overall reproducibility of V-pipe 3.0. Some further explanation of the rationale would be very useful here. -- In the "Results" section, lines 179 - 183, it is not clear how Git and GitHub ensure adaptability of any new features that are implemented. Usually a version control system/automation system, can facilitate the integration of new features, but it's not readily evident how it supports/ensures/facilitates adaptability. Maybe a definition of "adaptability" in this particular context could also help. -- In the "Applications" section, it is not clear which version of V-Pipe was used for the overall analysis (V-pipe or V-pipe 3.0), especially in the wastewater use case. -- In the section "Comparison to other workflows" it is not very clear which tools are implemented within V-pipe 3.0, which differences there are with previous version (V-pipe) and how these differ to other pipelines. A table that is summarizing these details and highlighting the differences would be very useful here. Moreover, there are a few minor points that would enhance the readers' understanding: -- (minor) In the Section "2.1 Reproducibility", it's mentioned that all software dependencies are defined in Conda environments, making V-pipe 3.0 portable between different computing platforms. Is there a particular reason why V-Pipe itself isn't implemented as a conda package directly? -- (minor) More often that not, the pandemic is named as COVID19, in contrast to the virus that is named "SARS-CoV-2". It may be useful to amend/update the references to the "SARS-CoV-2 pandemic" accordingly.
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2023.10.16.562462v1
Oct 2024
pmc.ncbi.nlm.nih.gov pmc.ncbi.nlm.nih.gov

CannSeek? Yes we Can! An open-source single nucleotide polymorphism database and analysis portal for Cannabis sativa

1
1. GigaScience 30 Oct 2024
  
  in Public
  
  Preprint submitted to: https://doi.org/10.25918/preprint.367
  
  A webinar with the authors is also available in Cassyni https://doi.org/10.52843/cassyni.y1p61f
Visit annotations in context

Annotators

GigaScience

URL

pmc.ncbi.nlm.nih.gov/articles/PMC11480739/
pmc.ncbi.nlm.nih.gov pmc.ncbi.nlm.nih.gov

Building a community-driven bioinformatics platform to facilitate Cannabis sativa multi-omics research

1
1. GigaScience 30 Oct 2024
  
  in Public
  
  Preprint submitted to: https://doi.org/10.1101/2024.10.02.616368
  
  A webinar with the authors is also available in Cassyni https://doi.org/10.52843/cassyni.y1p61f
Visit annotations in context

Annotators

GigaScience

URL

pmc.ncbi.nlm.nih.gov/articles/PMC11515022/
www.biorxiv.org www.biorxiv.org

Building a community-driven bioinformatics platform to facilitate Cannabis sativa multi-omics research

2
1. GigaScience 20 Oct 2024
  
  in GigaByte
  
  Editors Assessment:
  
  This paper reports the establishment of the International Cannabis Genomics Research Consortium (ICGRC) web portal leveraging the open source Tripal platform to enhance data accessibility and integration for Cannabis sativa (Cannabis) multi-omics research. With the aim of bringing together the wealth of publicly available genomic, transcriptomic, proteomic, and metabolomic data sets to improve cannabis for food, fiber and medicinal traits. Tripal is a content management system for genomics data, presenting a ready-to-use specialized ‘omics modules for loading, visualization, and analysis, and is GMOD (Generic Model Organism Database) standards-compliant. The paper explaining how this was put together, what data and features are available, and providing a case study for other communities wanting to create their own Tripal platform. Covering their setup and customizations of the Tripal platform, and how they re-engineered modules for multi-omics data integration, and addition of many other custom features that can be reused. Peer review fixed a few minor bugs and added clarifications on how the platform will be updated.
  
  *This evaluation refers to version 1 of the preprint *
  
  Summary
2. GigaScience 20 Oct 2024
  
  in GigaByte
  
  AbstractGlobal changes in Cannabis legislation after decades of stringent regulation, and heightened demand for its industrial and medicinal applications have spurred recent genetic and genomics research. An international research community emerged and identified the need for a web portal to host Cannabis-specific datasets that seamlessly integrates multiple data sources and serves omics-type analyses, fostering information sharing.The Tripal platform was used to host public genome assemblies, gene annotations, QTL and genetic maps, gene and protein expression, metabolic profile and their sample attributes. SNPs were called using public resequencing datasets on three genomes. Additional applications, such as SNP-Seek and MapManJS, were embedded into Tripal. A multi-omics data integration web-service API, developed on top of existing Tripal modules, returns generic tables of sample, property, and values. Use-cases demonstrate the API’s utility for various -omics analyses, enabling researchers to perform multi- omics analyses efficiently.
  
  This work has been published in GigaByte Journal under a CC-BY 4.0 license (https://doi.org/10.46471/gigabyte.137). These reviews are as follows.
  
  Reviewer 1. Weiwen Wang
  
  Is the code executable?
  
  Unable to test.
  
  This manuscript is about an online platform, and I am not sure how to test the code.
  
  Is installation/deployment sufficiently outlined in the paper and documentation, and does it proceed as outlined?
  
  Same as above.
  
  Additional Comments:
  
  With the increasing legalization of cannabis in many countries today, exploring this crop has become a hot topic of research. This manuscript by Mansueto et al. introduces a platform built on the Tripal framework, designed to facilitate multi-omics research in Cannabis sativa. The platform integrates genomic, transcriptomic, proteomic, and metabolomic data, providing researchers with a comprehensive resource for data analysis and sharing. Additionally, APIs have been developed, enabling rapid querying. This manuscript detailed information on how to customize Tripal modules and Chado schema for managing biological entities. Finally, this manuscript highlights the importance of standardization in data storage and analysis, proposing community-wide adoption of standardized nomenclature to ensure consistency and traceability of data. Overall, the platform is poised to become a valuable resource for cannabis research and to advance scientific progress in related fields.
  
  While this manuscript was engaging, particularly in the sections on Tripal "re-engineering" and controlled vocabulary, I do have several concerns.
  
  1 Because my registration (using business email) has not been approved, I cannot test the functions requiring ICGRC registration.
  
  2 The authors noted that the Cannabis Genome Browser has not been updated. Do the authors have a plan for updating ICGRC? If so, what is the proposed update frequency?
  
  3 ICGRC currently includes only a few cannabis cultivars, especially when compared to other platforms like Kannapedia and CannabisGDB. Do the authors have plans to add additional cultivars, such as First Light and Jamaican Lions mentioned in this manuscript, in the near future?
  
  4 When I tried to register using Gmail, an error popped up: ‘Domain is not allowed to register for this site’. Perhaps it would be clearer to instruct users to use a business email for registration directly.
  
  5 There is a data submission function in ICGRC, but the exact workings of this feature remain unclear to me. If a user submits a cannabis genome to the ICGRC, whether this data will be visualized within specific modules like synteny search or genetic mapping tools on the platform.
  
  Reviewer 2. Hongyun Shang
  
  Is the code executable?
  
  Unable to test.
  
  Is installation/deployment sufficiently outlined in the paper and documentation, and does it proceed as outlined?
  
  Unable to test.
  
  This is a comprehensive database with many features that improves the shortcomings of cannabis species that had no genome database in the past. It is a good work. Here are some minor suggestions:
  
  Did not find the function of searching gene and protein sequences directly by gene id without providing chromosome location, which may be a common feature of many omics databases.
  
  In the chapter "The need for cannabis multi-omics databases and analysis platforms", "There are no analysis tools or results available on this website", "No results available" seems inappropriate.
  
  In the chapter "Cannabis - Omics, Genetic and Phenotypic Datasets in the Public Domain", "Crop Ontology" Crop Ontology, is "Crop Ontology" repeated?
Visit annotations in context

Tags

Summary

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2024.10.02.616368v1
www.biorxiv.org www.biorxiv.org

PhysiMeSS - A New PhysiCell Addon for Extracellular Matrix Modelling

2
1. GigaScience 19 Oct 2024
  
  in GigaByte
  
  Editors Assessment:
  
  PhysiCell is an open source multicellular systems simulator for studying many interacting cells in dynamic tissue microenvironments. As part of the PhysiCell ecosystem of tools and modules this paper presents a PhysiCell addon, PhysiMeSS (MicroEnvironment Structures Simulation) which allows the user to accurately represent the extracellular matrix (ECM) as a network of fibres. This can specify rod-shaped microenvironment elements such as the matrix fibres (e.g. collagen) of the ECM, allowing the PhysiCell user the ability to investigate physical interactions with cells and other fibres. Reviewers asked for additional clarification on a number of features. And the paper now clear future releases will provide full 3D compatibility and include working on fibrogenesis, i.e. the creation of new ECM fibres by cells.
  
  This evaluation refers to version 1 of the preprint
  
  Summary
2. GigaScience 19 Oct 2024
  
  in GigaByte
  
  AbstractThe extracellular matrix is a complex assembly of macro-molecules, such as collagen fibres, which provides structural support for surrounding cells. In the context of cancer metastasis, it represents a barrier for the cells, that the migrating cells needs to degrade in order to leave the primary tumor and invade further tissues. Agent-based frameworks, such as PhysiCell, are often use to represent the spatial dynamics of tumor evolution. However, typically they only implement cells as agents, which are represented by either a circle (2D) or a sphere (3D). In order to accurately represent the extracellular matrix as a network of fibres, we require a new type of agent represented by a segment (2D) or a cylinder (3D).In this article, we present PhysiMeSS, an addon of PhysiCell, which introduces a new type of agent to describe fibres, and their physical interactions with cells and other fibres. PhysiMeSS implementation is publicly available at https://github.com/PhysiMeSS/PhysiMeSS, as well as in the official Physi-Cell repository. We also provide simple examples to describe the extended possibilities of this new framework. We hope that this tool will serve to tackle important biological questions such as diseases linked to dis-regulation of the extracellular matrix, or the processes leading to cancer metastasis.
  
  This work has been published in GigaByte Journal under a CC-BY 4.0 license (https://doi.org/10.46471/gigabyte.136), and has published the reviews under the same license. It is also part of GigaByte’s PhysiCell Ecosystem series for tools that utilise or build upon the PhysiCell platform: https://doi.org/10.46471/GIGABYTE_SERIES_0003 These reviews are as follows.
  
  Reviewer 1. Erika Tsingos
  
  One important aspect that the authors need to be aware of and mention explicitly is that their algorithm for fiber set-up leads to differences in fiber concentration and orientation at the boundary, because fibers that are not wholly contained in the simulation box are discarded. The effect of this choice can be seen upon close inspection of Figure 2: In the left panel, fibers align tangentially to the boundary, so locally the orientation is not isotropic. Similarly, in Figure 2 middle and right panels, the left and right boundaries have lower local fiber concentration. This issue could potentially affect the outcome of a simulation, so it's important that readers are made aware so that if necessary they can address this with a modified algorithm. ----- Minor comments: In the abstract, the phrasing implies agent-based frameworks are only used for tumour evolution. I would rephrase such that it is clear that tumour evolution is one example among many possible applications. I suggest adding a dash to improve readability in the following sentence in the introduction: "However, we note that the applications of PhysiMeSS stretch beyond those wanting to model the ECM -- as the new cylindrical/rod-shaped agents could be used to model blood vessel segments or indeed create obstacles within the domain." In the implementation section, add a short sentence to clarify if PhysiMeSS is "backwards compatible" with older PhysiCell models that do not use the fiber agent. Notation in equations: A single vertical line is absolute value, and two vertical lines is Euclidean norm? The explanation of Equation 1 implies that the threshold v_{max} should limit the parallel force, but the text does not explicitly say if ||v|| is restricted to be less or equal to v_{max}. Is that the case? In Equation 2, I don't see the need to square the terms in parenthesis. If |v*l_f| is an absolute value it is always positive. Since l_f is normalized the value of the dot product is only between 0 and the magnitude of v. Am I missing something? Are p_x and p_y in the moment arm magnitude coordinates with respect to the fiber center? Table 2: It would be helpful to have a separate column with the corresponding symbols used throughout the text and equations. Figure 5/6: Missing crosslinker color legend. ----- Typos/grammar: "As an aside, an not surprisingly," --> As an aside, and not surprisingly, "This may either be because as a cell tries to migrate through the domain fibres which act as obstacles in the cell’s path," --> remove the word "which"
  
  Reviewer 2. Jinseok Park
  
  Noel et al. introduce PhysiMess - a new PhysiCell Addon for ECM remodeling. This new addon is a powerful tool to simulate ECM remodeling and has the potential to be applied to mechanobiology research, which makes my enthusiasm high. I would like to give a few suggestions. 1) Basically, it is an addon of PhysiCell. So, I suggest describing PhysiCell and how to add the addon for readers who are not familiar with these tools. Also, screen captures of tool manipulation would be very helpful. 2) Figure 2 and 3 exhibit the outcome of the addon showing ECM remodeling. I would suggest to show actual ECM images modeled by the addon. 3) The equations reflect four interactions, and in my understanding, the authors describe cell-fibre, fiber-cell, and fiber-fiber interactions. I suggest generating an example corresponding to each interaction's modulation and explaining how the add-on results explain the physiological phenomena. For instance, focal adhesion may be a key modulator of cell-fibre or fiber-cell interaction, presumably, alpha or beta fiber. I would demonstrate how the different parameters generate different results and explain the physiological situation modeled by the results. 4) Similarly, Figure 5 and Figure 6 only show one example and no comparison with other conditions. For example, It would be better to exhibit no pressure/pressure conditions. It may help readers estimate how the pressure impacts cell proliferation.
  
  Reviewer 3. Simon Syga
  
  The presented paper "PhysiMeSS - A New PhysiCell Addon for Extracellular Matrix Modelling" is a useful extension to the popular simulation framework PhysiCell. It enables the simulation of cell populations interacting with the extracellular matrix, which is represented by a set of line segments (2D) or cylinders (3D). These represend a new kind of agent in the simulation framework. The paper outlines the basic implementation, properties and interactions of these agents. I recommend publication after a small set of minor issues have been addressed. Please refer to the attached marked-up PDF file for these minor issues and suggestions. https://gigabyte-review.rivervalleytechnologies.comdownload-api-file?ZmlsZV9wYXRoPXVwbG9hZHMvZ3gvVFIvNTUwL2d4LVRSLTE3MTk5NDYwNjlfU1kucGRm
Visit annotations in context

Tags

Summary

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2023.10.27.564365v1
academic.oup.com academic.oup.com

Deepdefense: annotation of immune systems in prokaryotes using deep learning

1
1. GigaScience 13 Oct 2024
  
  in Gigascience Annotations
  
  The data used for training are publicly available as part of the publication [26]. The used HMM models are part of the publication [27]. Additionally, data were taken from the PADLOC website [41].
  
  DOME-ML annotations are also available for scruitiny https://dome.ds-wizard.org/projects/8f35140d-3b02-4328-ac18-c6bd9f8620e4
Visit annotations in context

Annotators

GigaScience

URL

academic.oup.com/gigascience/article/doi/10.1093/gigascience/giae062/7817746
academic.oup.com academic.oup.com

3D-Beacons: decreasing the gap between protein sequences and structures through a federated network of protein structure data resources

2
1. GigaScience 11 Oct 2024
  
  in Gigascience Annotations
  
  John Jumper,
  
  Joint winner of the 2024 Nobel Prize in Chemistry https://x.com/NobelPrize/status/1843951197960777760
2. GigaScience 11 Oct 2024
  
  in Gigascience Annotations
  
  Demis Hassabis,
  
  Joint winner of the 2024 Nobel Prize in Chemistry https://x.com/NobelPrize/status/1843951197960777760
Visit annotations in context

Annotators

GigaScience

URL

academic.oup.com/gigascience/article/doi/10.1093/gigascience/giac118/6854872
Sep 2024
www.biorxiv.org www.biorxiv.org

High-speed whole-genome sequencing of a Whippet: Rapid chromosome-level assembly and annotation of an extremely fast dog’s genome

2
1. GigaScience 27 Sep 2024
  
  in GigaByte
  
  Editors Assessment:
  
  This Data Release paper presents the genome of the whippet breed of dog. Demonstrating a streamlined laboratory and bioinformatics workflows with PacBio HiFi long-read whole-genome sequencing that enables the generation of a high-quality reference genome within one week. The genome study being a collaboration between an academic biodiversity institute and a medical diagnostic company. The presented method of working and workflow providing examples that can be used for a wide range of future human and non-human genome projects. The final is 2.47 Gbp assembly being of high quality - with a contig N50 of 55 Mbp and a scaffold N50 of 65.7 Mbp. This reference being scaffolded into 39 chromosome-length scaffolds and the annotation resulting in 28,383 transcripts. The results also looked at the Myostatin gene which can be used for breeding purposes, as these heterozygous animals can have an advantage in dog races. The reviewers making the authors clarify this part a little better with additional results. Overall this study demonstrating how rapidly animal genome research can be carried out through close and streamlined time management and collaboration.
  
  This evaluation refers to version 1 of the preprint
  
  Summary
2. GigaScience 27 Sep 2024
  
  in GigaByte
  
  AbstractBackground The time required for sequencing and de novo assembly of genomes is highly dependent on the interaction between laboratory work, sequencing capacity, and the bioinformatics workflow. As a result, genome projects are often not only limited by financial, computational and sequencing platform resources, but also delayed by second party sequencing service providers. By bringing together academic biodiversity institutes and a medical diagnostics company with extensive sequencing capabilities and know-how, we aimed at generating a high-quality mammalian de novo genome in the shortest possible time period. Therefore, we streamlined all processes involved and chose a very fast dog as a model: The Whippet.Findings We present the first chromosome-level genome assembly of the Whippet. We used PacBio long-read HiFi sequencing and reference-guided scaffolding to generate a high-quality genome assembly. The final assembly has a contig N50 of 55 Mbp and a scaffold N50 of 65.7 Mbp. The total assembly length is 2.47 Gbp, of which 2.43 Gpb were scaffolded into 39 chromosome-length scaffolds. In addition, we used available mammalian genomes and transcriptome data to annotate the genome assembly. The annotation resulted in 28,383 transcripts resembling a total of 90.9% complete BUSCO genes and identified a repeat content of 36.5%.Conclusions Sequencing, assembling, and scaffolding the chromosome-level genome of the Whippet took less than a week and adds a high-quality reference genome to the list of domestic dog breeds sequenced to date.
  
  This work has been published in GigaByte Journal under a CC-BY 4.0 license (https://doi.org/10.46471/gigabyte.134), and has published the reviews under the same license. These are as follows.
  
  Reviewer 1. Tianming Lan
  
  The authors provided an example of High-speed strategy for whole-genome sequencing, genome assembly and annotation for species and take an example with the Whippet dog. This is a very novel idea under the genomic era with plummeting sequencing cost, fast accumulated sequencing data but shortage of computing resources. The authors also provide a very high-quality reference genome for the Whippet dog species with very good contiguity, accuracy and completeness. However, I have several concerns need the authors to further consider before it could be published at the journal of GigaByte.
  
  Q1. There are too many keywords. Can the authors reduce a few? Biodiversity conservation, Comparative genomics, and evolutionary biology does not make sense in this manuscript. Q2. The authors performed reference-guided scaffolding analysis with the German Shepherd dog genome (GCA_011100685.1) as reference. Better if the authors explain why they selected this genome as the reference as there are several published dog genomes? Q3.The part of Heterozygosity make no sense to this manuscript unless there is a reasonable connection with other parts, because the dog is not a threatened species and also not a very special breed facing extensive inbreeding abd accumulation of deleterious mutations? Q4. The part of Myostatin doesn’t make sense to me, as I have read the paper the author cited and found that not all Whippet have this mutation? They sequenced 22 individuals, and 4 individuals are homozygous (-/-), 5 are heterozygous (mh/+) and the rest are homozygous (+/+). So you can always have a result by checking this mutation, but make no sense. Furthermore, one individual can hardly represent a species or a population? At the beginning of this paragraph, please change “Since” to “Since”. Q5. I think the most important find in this manuscript is how the authors finished a high-quality genome within a very short-term working. I suggest the authors remove the descriptions of Heterozygosity and Myostatin, but added a paragraph to tell readers the basic needs or standards for such a short-term work for genome assembly for a genome of something like dog. Just a suggestion, but I think would be better to improve the manuscript.
  
  Reviewer 2. Xiaobo Wang
  
  This study outlines an approach to expedite the sequencing and de novo assembly of genomes by leveraging collaboration between academic biodiversity institutes and a medical diagnostics company with advanced sequencing capabilities. The primary focus was on generating a high-quality de novo genome of the Whippet, a fast dog breed, within an accelerated timeframe. Below are some specific comments I would like to highlight.
  
  The authors mentioned the use of QUAST and QualiMap software tools to assess the genome of the Whippet; however, the corresponding results were not presented in the manuscript.
  
  The authors' reliance solely on mammalian protein sequences for homology annotation means that unique genes specific to the Whippet remain unannotated. The discrepancy of approximately 7% between the completeness assessments of the gene set and the genome via BUSCO further underscores the incomplete nature of the gene set. To address this, I recommend integrating transcriptome data, at the very least, to incorporate de novo annotation results. This addition should enhance the comprehensiveness and accuracy of gene annotations for the Whippet genome.
  
  The authors claim the absence of reported mutations in the Mstn gene but have not provided corroborating evidence, such as read alignment results from the genomic region, to verify that this is not due to assembly errors.
  
  If feasible, I propose integrating second-generation sequencing to further polish the genome and elevate its quality.
Visit annotations in context

Tags

Summary

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2024.08.16.608262v1
www.biorxiv.org www.biorxiv.org

RiboSnake – a user-friendly, robust, reproducible, multipurpose and documentation-extensive pipeline for 16S rRNA gene microbiome analysis

2
1. GigaScience 04 Sep 2024
  
  in GigaByte
  
  Editors Assessment:
  
  This new software paper presents RiboSnake, a validated, automated, reproducible analysis pipeline implemented in the popular Snakemake workflow management system for microbiome analysis. Analysing16S rRNA gene amplicon sequencing data, this uses the widely used oQIIME2 [ tool as the basis of the workflow as it offers a wide range of functionality. Users of QIIME2 can be overwhelmed by the number of options at their disposal, and this workflow provides a fully automated and fully reproducible pipeline that can be easily installed and maintained. Providing an easy-to-navigate output accessible to non bioinformatics experts, alongside sets of already validated parameters for different types of samples. Reviewers requested some clarification for testing, worked examples and documentation, and this was improved to produce a convincingly easy-to-use workflow. Hopefully opening up an already very established technique to a new group of users and assisting them with reproducible science.
  
  This evaluation refers to version 1 of the preprint
  
  Summary
2. GigaScience 04 Sep 2024
  
  in GigaByte
  
  AbstractBackground Next-generation sequencing for assaying microbial communities has become a standard technique in recent years. However, the initial investment required into in-silico analytics is still quite significant, especially for facilities not focused on bioinformatics. With the rapid decline in costs and growing adoption of sequencing-based methods in a number of fields, validated, fully automated, reproducible and yet flexible pipelines will play a greater role in various scientific fields in the future.Results We present RiboSnake, a validated, automated, reproducible QIIME2-based analysis pipeline implemented in Snakemake for the computational analysis of 16S rRNA gene amplicon sequencing data. The pipeline comes with pre-packaged validated parameter sets, optimized for different sample types. The sets range from complex environmental samples to patient data. The configuration packages can be easily adapted and shared, requiring minimal user input.Conclusion RiboSnake is a new alternative for researchers employing 16S rRNA gene amplicon sequencing and looking for a customizable and yet user-friendly pipeline for microbiome analysis with in-vitro validated settings. The complete analysis generated with a fully automated pipeline based on validated parameter sets for different sample types is a significant improvement to existing methods. The workflow repository can be found on GitHub (https://github.com/IKIM-Essen/RiboSnake).
  
  This work has been published in GigaByte Journal under a CC-BY 4.0 license (https://doi.org/10.46471/gigabyte.132), and has published the reviews under the same license. These are as follows.
  
  Reviewer 1. Michael Hall
  
  Is installation/deployment sufficiently outlined in the paper and documentation, and does it proceed as outlined?
  
  Unable to test. The README states "If you want to test the RiboSnake functions yourself, you can use the same data used for the CI/CD tests." A worked example of how I can do this would be appreciated so I can test the workflow.
  
  Is there enough clear information in the documentation to install, run and test this tool, including information on where to seek help if required?
  
  The Usage instructions say to create a new repository using ribosnake as a template, but ribosnake is not a template repository (see https://docs.github.com/en/repositories/creating-and-managing-repositories/creating-a-repository-from-a-template). The README states "If you want to test the RiboSnake functions yourself, you can use the same data used for the CI/CD tests." A worked example of how I can do this would be appreciated so I can test the workflow.
  
  Have any claims of performance been sufficiently tested and compared to other commonly-used packages?
  
  Not applicable.
  
  Is automated testing used or are there manual steps described so that the functionality of the software can be verified?
  
  Yes, though as mentioned above, the README states "If you want to test the RiboSnake functions yourself, you can use the same data used for the CI/CD tests." A worked example of how I can do this would be appreciated so I can test the workflow.
  
  Additional Comments:
  
  The Introduction could be make far more concise, there's a lot of repetition.
  
  The installation command in figure 1 is three commands, not two as stated in the text (third-last paragraph Introduction), and is slightly misleading from an installation point of view as it assumes conda and snakemake are installed. Though it is mentioned later in the text (p5) that snakemake and conda require manual installation.
  
  The in-text citation for Greengenes2 is just [?] - maybe a latex issue?
  
  The last paragraph of the 'Features and Implementations' section was mostly already stated earlier in the manuscript.
  
  Make the colouring consistent between fig 2a-c and 2d as well as the vertical ordering to make for easier comparison. For example, in figures 2a-c Enterococcus (grey) is on the bottom, whereas in fig 2d it is red and in the middle. Colour legends should also be added to Figures 3-5 to match Fig 2.
  
  A small table should be added showing the comparison of RiboSnake and the original publication for the top 10 most abundant phyla for the Atacama soil dataset and their abundances (see last paragraph of 'Usage and Findings'.
  
  Reviewer 2. Yong-Xin Liu and Salsabeel Yousuf
  
  The manuscript presented by the authors describes a comprehensive study on the “RiboSnake pipeline” for 16S rRNA gene microbiome analysis, which is a user-friendly, robust, and multipurpose. RiboSnake, a validated, automated, reproducible QIIME2-based analysis pipeline implemented in Snakemake, offers parallel processing for efficient analysis of large datasets in both environmental and medical research contexts. Further demonstrating its effectiveness, this pipeline effectively analyzes human-associated microbiomes and environmental samples like wastewater and soil, thus expanding the scope of analysis for 16S rRNA data. The overall computational pipeline is useful and results are sound, validated through rigorous testing on MOCK communities and real-world datasets. However, there are some issues for improvement in the manuscript.
  
  Major comments: 1． In the clinical data section the author mentions rectal swabs were used from a published study [31]. While the source is referenced, it would be helpful to know if any information was provided in the referenced study regarding the collection methods or storage conditions for the rectal swabs. 2． The text mentions using cotton swabs pre-moistened with TE buffer + 0.5% Tween 20. While cotton swabs are common, are there any considerations for using different swab materials depending on the target analytes or sampling surface (e.g., flocked swabs for better epithelial cell collection)? 3． Does RiboSnake require user intervention during any steps, or is it fully automated? 4． The author mentions that contamination filtering parameters should be adjusted based on the sample type. How can users determine the appropriate filtering parameters for their specific samples? Are there guidelines for users to know how much adjustment is needed for specific scenarios? 5． The default abundance threshold for filtering low-frequency reads is chosen based on Nearing et al. [44]. Please discuss the rationale behind using a single threshold for all sample types? Would it be beneficial to allow users to define this threshold based on their data characteristics? 6． Would you like to explain the limitation of RiboSnake, such as specific types of samples it may not be suitable for or potential biases introduced by certain functionalities? 7． The manuscript mentions various visualization tools used throughout the pipeline (QIIME2, qurro). Please clarify which types of data are visualized with each tool, and how users can access or customize these visualizations? 8． To strengthen the manuscript's impact, consider discussing the specific novelty of RiboSnake compared to existing 16S rRNA gene microbiome analysis pipelines. Would you be able to elaborate on the unique features or functionalities of RiboSnake that address limitations of current methods? 9． EasyAmplicon is recently published pipeline and easy using in windows, mac and linux system,
  
  Minor comments: 1. Reference is missing in this sentence. “The default is the SILVA database [47]. Greengenes2 [? ] can be used alternatively”. 2. The author should careful about the lowercase and upper case throughout the manuscript. Please check the following for references:  ..the 2017 published Atacama Soil data set with samples taken fromthe Atacama desert was used [32] as well as samples collected fromsoil under switchgrass published in [33].  based on an Euclidean beta diversitymetric, shows that the positive controls, as well as the samples taken from subjects 1 and 3 (S1 and S3), cluster together.  A wide range of diversity analysis parameters are available in QIIME2 and its associated tools. These include the Shannon diversity index to measure richness, the Pielou index tomeasure evenness, or perform standard correlation analysis using Pearson or Spearman indices, among others. 3. In the introduction part this sentences “However, while these methods enable 16S rRNA analysis with minimal user interaction…” needs attention for clarity. Consider separating it into two sentences to emphasize the limitations of existing pipelines compared to the described methods’. Alternatively, using contrasting words like "in contrast" could highlight these differences. 4. More detail in attached PDF.
  
  https://gigabyte-review.rivervalleytechnologies.comdownload-api-file?ZmlsZV9wYXRoPXVwbG9hZHMvZ3gvVFIvNTM5L2d4LVRSLTE3MTY5Nzk4MTktcmV2aXNlZC5wZGY=
  
  Re-review: The author's response has been fully addressed my concerns. The quality of the paper has apparently improved. I agree with the publication of this article.
Visit annotations in context

Tags

Summary

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2024.08.06.606757v1
www.biorxiv.org www.biorxiv.org

TooManyCellsInteractive: a visualization tool for dynamic exploration of single-cell data

3
1. GigaScience 03 Sep 2024
  
  in GigaScience
  
  AbstractAs single-cell sequencing data sets grow in size, visualizations of large cellular populations become difficult to parse and require extensive processing to identify subpopulations of cells. Managing many of these charts is laborious for technical users and unintuitive for non-technical users. To address this issue, we developed TooManyCellsInteractive (TMCI), a browser-based JavaScript application for visualizing hierarchical cellular populations as an interactive radial tree. TMCI allows users to explore, filter, and manipulate hierarchical data structures through an intuitive interface while also enabling batch export of high-quality custom graphics. Here we describe the software architecture and illustrate how TMCI has identified unique survival pathways among drug-tolerant persister cells in a pan-cancer analysis. TMCI will help guide increasingly large data visualizations and facilitate multi-resolution data exploration in a user-friendly way.
  
  A version of this preprint has been published in the Open Access journal GigaScience (see paper https://doi.org/10.1093/gigascience/giae056), where the paper and peer reviews are published openly under a CC-BY 4.0 license. These peer reviews were as follows:
  
  Reviewer 3: Georgios Fotakis
  
  1) General comments In this manuscript the authors present TooManyCellsInteractive (TMCI), a browser-based TypeScript graphical user interface for the visualization and interactive exploration of single-cell data. TMCI facilitates the visualization of single-cell data by representing it as a radial tree of nested cell clusters. It relies on TooManyCells, a suite of tools designed for multi-resolution and multifaceted exploration of single-cell clades based on a matrix-free divisive hierarchical spectral clustering method. A key advantage of TCMI lies in its capability to provide a quantitative depiction of relationships among clusters, allowing for the delineation of context-dependent rare and abundant cell populations, as showcased in the original publication [1] and in the present manuscript. TMCI extends the capabilities of TMC significantly, notably enhancing computational performance, particularly in scenarios where multiple features are overlaid (an improvement that is attributed to the persistent feature of the PostgreSQL database).
  
  A notable aspect of this manuscript is the fact that the authors performed a benchmark using publicly available scRNAseq datasets. This benchmark highlights TMCI's superior performance over TMC and its comparable performance to two other commonly utilized tools (Cirrocumulus and CELLxGENE). Moreover, the authors showcase TMCI's applicability through aggregating publicly available scRNAseq data. Here, they successfully delineate sub-populations of cancer drug-tolerant persister cells by employing minimum distance search pruning, enhancing the visibility of small sub-populations. Additionally, the authors note an increase in ID2 gene expression among persister-cell populations, as well as the enrichment of unique biological programs between short- and long-term persister-cell populations. Furthermore, they observe an upregulation of the diapause gene signature across all treated sub-populations. The biological insights the authors glean are novel and highly intriguing. In general, this manuscript is well written, with the authors offering comprehensive documentation that covers the essential steps for installing and running TMCI through their GitHub repository. Additionally, they provide a minimal dataset as an example for users. However, there are a few minor adjustments that, once implemented, would enhance the manuscript's value by improving clarity and providing valuable insights to the field.
  
  2) Specific comments for revision a) Major - As stated in the manuscript's abstract, visualising large cell populations from single-cell atlases poses greater challenges and demands compute-intensive processes. One of my major concerns revolves around TMCI's scalability when handling large datasets. The authors conducted benchmarking on relatively modest datasets (ranging from 18,859 to 54,220 cells). Based on the data provided in Supplementary Table S3, while TMCI demonstrates comparable performance to CELLxGENE on the Tabula Muris dataset and its subset (with mean memory consumption differences ranging from 870 MB to 1.8 GB), the disparity significantly increases when loading and rendering visualizations of the larger dataset, reaching 8.5 GB of RAM. It would be of great interest if the authors conducted a similar benchmark using a larger dataset to elucidate how TMCI scales with increased cell numbers, especially considering the trend in the field towards single-cell atlases and the availability of datasets consisting up to millions of cells (like the Tabula Sapiens [2] dataset or similar [3, 4]).
  
  In the "Results" section, under the title "TMCI identifies sub-populations with highly expressed diapause programs," the authors assert that "the significantly different sub-populations were more easily seen in TMCI's tree". Since perception can be subjective (for instance, a user more accustomed to UMAP plots may find it challenging to interpret a tree representation), it would be beneficial for the authors to allocate a section of the supplementary material to demonstrate the clarity advantages of TMCI's tree visualization. One approach could involve a side-by-side comparison of visualizations generated by TMCI and CELLxGENE using the same color scheme. For instance, Figure 4b could be compared with Supplementary Figure S1g, Figure 4j with Supplementary Figure S1h, and so forth.
  
  The "Discussion" section overlooks the future prospects of TMCI. As demonstrated in the case study, TMCI exhibits potential beyond serving as a visualization tool for identifying tree-based relationships in single-cell data. Are there any plans for integrating analytical functionalities to provide insights into cellular compositions and underlying biology, such as marker gene identification, differential gene expression analysis, and gene set enrichment analysis? In the future, could TMCI support the visualization of such results using methods like violin plots, heatmaps, and others?
  
  In the "Materials and Methods'' section, the authors outline the process of aggregating the scRNAseq datasets used for the case study, including filtering and normalization steps. However, scRNAseq technologies are prone to significant noise resulting from amplification and dropout events. Additionally, when integrating different scRNAseq datasets, users need to consider potential batch effects. Did the authors employ any de-noising or batch correction methods? If not, what was the rationale behind this decision? It would be intriguing to observe any potential differences in the results following the application of such methods.
  
  Remaining within the "Materials and Methods" section, providing a brief description of the methods and tools utilized for the differential gene expression analysis, the GSEA (if not solely conducted through Metascape), and the packages utilized to generate the plots in Figures 3 and 4 would enhance clarity and facilitate reproducibility.
  
  Figure 4 - b: Distinguishing between the various cell lines on the partitioned nodes based on the current color coding—particularly for the MDA-MB-231 and PC9 cell lines, as well as between the treated and untreated populations of the SK-MEL-28 cell line—is quite challenging. Employing a different color scheme would significantly enhance clarity, making the different cell populations more distinguishable.
  
  Figure 4 - d and k: The authors should add statistics as relying solely on the box and whisker plots makes it challenging to ascertain whether there is a significant difference between the conditions. For instance, it appears that ID2 is over-expressed between the control and treated population only in the SK-MEL-28 cell line.
  
  b) Minor - In the "Results" section, under the title "TMCI reduces time to display trees," the authors state: "these benchmarks indicate not only the superior performance of TMCI to generate static and interactive tree of single-cell data compared to other tools…". However, based on the results presented in the manuscript and the supplementary material, it seems that TMCI may not be outperforming alternative interactive visualization methods. This phrase should be revised to accurately reflect the benchmark results.
  
  References 1. Schwartz GW, Zhou Y, Petrovic J, Fasolino M, et al. TooManyCells identifies and visualizes relationships of single-cell clades. Nat Methods 2020;17(4):405-413. PMID: 32123397 2. The Tabula Sapiens Consortium, The Tabula Sapiens: A multiple-organ, single-cell transcriptomic atlas of humans. Science 2022;376, eabl4896. DOI:10.1126/science.abl4896 3. Sikkema L, Ramírez-Suástegui C, Strobl DC, et al. An integrated cell atlas of the lung in health and disease. Nat Med 2023;29, 1563-1577. DOI:10.1038/s41591-023-02327-2 4. Salcher S, Sturm G, Horvath L, et al. High-resolution single-cell atlas reveals diversity and plasticity of tissue-resident neutrophils in non-small cell lung cancer. Cancer cell 2022;40(12):1503-1520.E8. DOI:10.1016/j.ccell.2022.10.008
2. GigaScience 03 Sep 2024
  
  in GigaScience
  
  AbstractAs single-cell sequencing data sets grow in size, visualizations of large cellular populations become difficult to parse and require extensive processing to identify subpopulations of cells. Managing many of these charts is laborious for technical users and unintuitive for non-technical users. To address this issue, we developed TooManyCellsInteractive (TMCI), a browser-based JavaScript application for visualizing hierarchical cellular populations as an interactive radial tree. TMCI allows users to explore, filter, and manipulate hierarchical data structures through an intuitive interface while also enabling batch export of high-quality custom graphics. Here we describe the software architecture and illustrate how TMCI has identified unique survival pathways among drug-tolerant persister cells in a pan-cancer analysis. TMCI will help guide increasingly large data visualizations and facilitate multi-resolution data exploration in a user-friendly way.
  
  A version of this preprint has been published in the Open Access journal GigaScience (see paper https://doi.org/10.1093/gigascience/giae056), where the paper and peer reviews are published openly under a CC-BY 4.0 license. These peer reviews were as follows:
  
  Reviewer 2: Mehmet Tekman
  
  PAPER: TOOMANYCELLSINTERACTIVE REVIEW
  
  Table of Contents
  
  Using the Application .. 1. Positive Notes: ..... 1. General UI and Execution .. 2. Negative Notes: ..... 1. Controls ..... 2. Documentation ..... 3. Feature Overlays:
  
  Docker / Postgreseql
  
  Ethos of the Introduction
  
  The manuscript reads very well, and the quality of the language is good.
  
  This review tests the application itself, and makes some comment about some ambiguous wording in the introduction
  
  1 Using the Application
  
  I tested the Interactive Display at https://tmci.schwartzlab.ca/
  
  1.1 Positive Notes: ~~~~~~~~~~~~~~~~~~~
  
  1.1.1 General UI and Execution
  
  The general interactivity of the UI was very impressive and expressive. I liked that every aspect including the pies and the lines themselves could be coloured and scaled.
  
  I found the feature overlays and pruning history stack very intuitive, as well as rolling back the history on each state change.
  
  The choice of D3 was a good one, enabling very pleasing animations enter/exit/update state animations, as well as ease of SVG export.
  
  The inclusion of a command line `generate-svg.sh' for rendering without a browser is very useful.
  
  1.2 Negative Notes: ~~~~~~~~~~~~~~~~~~~
  
  1.2.1 Controls
  
  At first I wasn't able to find the controls, despite having the page open to 1330px wide, but then I realised I had to scroll down outside of the SVG container to find them.
  
  As mentioned in a recently opened PR, there's a CSS media rule `@media only screen and (min-width:1238px)' taking place, that looks strange on my Firefox 122 on Linux. Maybe better media rules for screens in the 700-900px wide range might be useful, as well as making separate rules for smartphones.
  
  1.2.2 Documentation
  
  Typescript is a good language to develop in, and lends itself naturally to documentation, though I did notice a distinct lack of documentation above many functions in the code base.
  
  Perhaps write a bit more documentation to make the code base accessible to new collaborators?
  
  Otherwise, the quality of code looked good, and the license was GPLv3 which is always welcome.
  
  1.2.3 Feature Overlays:
  
  I found the feature overlays super useful, though limited by the number of colours. These appear to be limited to one colour for all genes.
  
  Very useful for showing multiple genes, but it would be nice to have the ability to colour the expression of different genes with different colours, at least for < 3 genes of interest (due to the difficult colour mixing constraints).
  
  2 Docker / Postgreseql
  
  It is not clear to me what the Node server and PostgresQL database run in the docker container are actually doing, other than fetching cell metadata and marking user subsets from pruning actions.
  
  Could this not have been implemented in Javascript (e.g. IndexedDB)? Why does the data need to be hosted, if it's the user loading it from their own machine anyway. Is the idea that the visualization should be shared by multiple users who will be accessing the same dataset?
  
  If this is a single-user analysis, then why not keep all the computation and retrieval on the client-side?
  
  The reason I'm asking this is because I believe that by keeping the database operations within Javascript, you could run the system within a single Conda environment, or even with pure Node lockfile.
  
  I can understand needing a Docker for development purposes, but to actually run the software itself seems excessive. Is it not possible to separate the client and server into Conda? That way, one could then include the vizualisation (as the end stage) in bioinformatic pipelines.
  
  3 Ethos of the Introduction
  
  This is a small wording complaint in the Introduction section.
  
  TooManyCellsInteractive (TMCI) presents itself as a solution to the conventional scRNA-seq workflows that prepare the data via the usual: data → PCA → UMAP→ kNN → clustering stages.
  
  TMCI hints that it as an alternative solution to this workflow, but from what I can see in the documentation, it appears to require a cluster_tree.json' file, one that is produced only by the TooManyCells (TMC) pipeline.
  
  Unless I've misunderstood, it's not accurate to say that TMCI is an alternative to these conventional workflows, but that TMC is.
  
  TMCI simply consumes the files output by TMC and renders them. If what I'm saying is true, then the introduction should reflect that.
3. GigaScience 03 Sep 2024
  
  in GigaScience
  
  AbstractAs single-cell sequencing data sets grow in size, visualizations of large cellular populations become difficult to parse and require extensive processing to identify subpopulations of cells. Managing many of these charts is laborious for technical users and unintuitive for non-technical users. To address this issue, we developed TooManyCellsInteractive (TMCI), a browser-based JavaScript application for visualizing hierarchical cellular populations as an interactive radial tree. TMCI allows users to explore, filter, and manipulate hierarchical data structures through an intuitive interface while also enabling batch export of high-quality custom graphics. Here we describe the software architecture and illustrate how TMCI has identified unique survival pathways among drug-tolerant persister cells in a pan-cancer analysis. TMCI will help guide increasingly large data visualizations and facilitate multi-resolution data exploration in a user-friendly way.
  
  A version of this preprint has been published in the Open Access journal GigaScience (see paper https://doi.org/10.1093/gigascience/giae056), where the paper and peer reviews are published openly under a CC-BY 4.0 license. These peer reviews were as follows:
  
  Reviewer 1: Qingnan Liang
  
  Klamann et al. report a tool for single-cell data visualization, TMCI, which was related to the previous method of TMC. It is appreciated to see such continuous work and maintenance of the method and I do agree TMCI has the potential of promoting the application of TMC. The manuscript is generally well-written, and it suits well with the scope of GigaScience. The TMCI is publicly available with reasonably detailed tutorials. In this manuscript, however, at several points the elaboration does not provide sufficient details or rationales. I suggest revision/clarification as below before recommendation to publish.
  
  Does TMCI provide an interface with one or more popular single-cell frameworks, such as SingelCellExperiment, Seurat, or Scanpy? A TMCI user would probably use one of these frameworks to do other parts of the analysis.
  
  Is batch effect considered in the drug-treated data example? More generally, if a user want to use TMCI with multiple datasets, what would be the recommended approach for batch effect? Also, we know cell cycle is a factor that are usually 'regressed out' for single-cell analysis. Does TMC/TMCI consider this?
  
  "To normalize cells between data sets, we used term frequency-inverse document frequency to weigh genes such that more frequent genes across cells had less impact on downstream clustering analyses" We know TF-IDF is becoming a common practice in scATAC-seq analysis. Is this TF-IDF approach common for tree construction (or hierarchical clustering) with high dimensional data? Is this recommended for all users with scRNA-seq data?
  
  Figure 4C is not very easy to read. It may be helpful to label/highlight the comparison pairs to make the point.
  
  Also it is not sufficiently emphasized that how TMCI helped finding this ID2 target. Or how such visualization would trigger interesting downstream approaches. I guess the power of this tree approach is somehow similar to the increasingly popular 'metacell' approach, which combine similar cells to 'cell states'. Thus it makes an interesting midpoint between 'single-cell' and 'pseudo-bulk'. It would really be helpful to see that some states (nodes), although similarly treated, behave differently than others, if there are such cases (not sure if cell lines have such heterogeneity). Similar comments for the pathway analysis part.
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2023.06.16.544954v1
Aug 2024
www.biorxiv.org www.biorxiv.org

MOBFinder: a tool for MOB typing for plasmid metagenomic fragments based on language model

2
1. GigaScience 12 Aug 2024
  
  in GigaScience
  
  AbstractBackground MOB typing is a classification scheme that classifies plasmid genomes based on their relaxase gene. The host range of plasmids of different MOB categories are diverse and MOB typing is crucial for investigating the mobilization of plasmid, especially the transmission of resistance genes and virulence factors. However, MOB typing of plasmid metagenomic data is challenging due to the highly fragmented characteristic of metagenomic contigs.Results We developed MOBFinder, an 11-class classifier to classify the plasmid fragments into 10 MOB categories and a non-mobilizable category. We first performed the MOB typing for classifying complete plasmid genomes using the relaxes information, and constructed the artificial benchmark plasmid metagenomic fragments from these complete plasmid genomes whose MOB types are well annotated. Based on natural language models, we used the word vector to characterize the plasmid fragments. Several random forest classification models were trained and integrated for predicting plasmid fragments with different lengths. Evaluating the tool over the benchmark dataset, MOBFinder demonstrates higher performance compared to the existing tool, with an overall accuracy of approximately 59% higher than the MOB-suite. Moreover, the balanced accuracy, harmonic mean and F1-score could reach 99% in some MOB types. In an application focused on a T2D cohort, MOBFinder offered insights suggesting that the MOBF type might accelerate the antibiotic resistance transmission in patients suffering from T2D.Conclusions To the best of our knowledge, MOBFinder is the first tool for MOB tying for plasmid metagenomic fragments. MOBFinder is freely available at https://github.com/FengTaoSMU/MOBFinder.
  
  A version of this preprint has been published in the Open Access journal GigaScience (see paper https://doi.org/10.1093/gigascience/giae047), where the paper and peer reviews are published openly under a CC-BY 4.0 license. These peer reviews were as follows:
  
  **Reviewer 1: Haruo Suzuki **
  
  I recommend that the authors consider revising based on the following points.
  
  the unpaired Wilcoxon signed-rank two-sided test. -> should be corrected to either "Wilcoxon rank-sum test" or "Mann-Whitney U test"
  
  https://en.wikipedia.org/wiki/Mann%E2%80%93Whitney_U_test "Wilcoxon rank-sum test" redirects here. For Wilcoxon signed-rank test, see Wilcoxon signed-rank test. https://en.wikipedia.org/wiki/Wilcoxon_signed-rank_test Not to be confused with Wilcoxon rank-sum test.
  
  Since MOBscan can only predict the MOB type with plasmid proteins, we annotated the plasmids in the test set with Prokka, then manually submitted them to the MOBscan website for MOB type annotation.
  
  Given that MOBScan operates as an online tool and cannot be executed locally, the calculation of MOBScan's run time was confined to the duration spent on preprocessing with Prokka locally." (Please refer to Line 313-319 in the revised manuscript).
  
  -> Actually, it can be executed locally using the scripts included in https://github.com/santirdnd/COPLA/. It may not be necessary to run MOBscan locally (it may be okay that they manually submitted them to the MOBscan website), but I'll inform you regardless.
  
  In the comparison, it was observed that MOBscan did not perform well, achieving low accuracy and kappa values across sequences of varying lengths, while MOB-suite exhibited marginally better performance than MOBscan when handling sequences of greater length (Figure 3A, 3B). (Please refer to Line 418-421 in the revised manuscript).
  
  -> Do the authors' results contradict the following general expectation? MOB-typer utilizes BLAST, whereas MOBscan utilizes hmmscan, and therefore, MOBscan is expected to retrieve more distantly related proteins than MOB-typer.
  
  MOB-suit and MOBscan are represented by blue lines, orange lines and gray lines respectively. -> should be "MOB-suite"
  
  I suggest receiving English language editing before publishing the paper. "For the MOB typing, MOBscan [18] uses the HMMER model to annotated the relaxases and further perform MOB typing." -> should be "For the MOB typing, MOBscan [18] uses the HMMER model to annotate the relaxases and further perform MOB typing."
2. GigaScience 12 Aug 2024
  
  in GigaScience
  
  AbstractBackground MOB typing is a classification scheme that classifies plasmid genomes based on their relaxase gene. The host range of plasmids of different MOB categories are diverse and MOB typing is crucial for investigating the mobilization of plasmid, especially the transmission of resistance genes and virulence factors. However, MOB typing of plasmid metagenomic data is challenging due to the highly fragmented characteristic of metagenomic contigs.Results We developed MOBFinder, an 11-class classifier to classify the plasmid fragments into 10 MOB categories and a non-mobilizable category. We first performed the MOB typing for classifying complete plasmid genomes using the relaxes information, and constructed the artificial benchmark plasmid metagenomic fragments from these complete plasmid genomes whose MOB types are well annotated. Based on natural language models, we used the word vector to characterize the plasmid fragments. Several random forest classification models were trained and integrated for predicting plasmid fragments with different lengths. Evaluating the tool over the benchmark dataset, MOBFinder demonstrates higher performance compared to the existing tool, with an overall accuracy of approximately 59% higher than the MOB-suite. Moreover, the balanced accuracy, harmonic mean and F1-score could reach 99% in some MOB types. In an application focused on a T2D cohort, MOBFinder offered insights suggesting that the MOBF type might accelerate the antibiotic resistance transmission in patients suffering from T2D.Conclusions To the best of our knowledge, MOBFinder is the first tool for MOB tying for plasmid metagenomic fragments. MOBFinder is freely available at https://github.com/FengTaoSMU/MOBFinder.
  
  A version of this preprint has been published in the Open Access journal GigaScience (see paper https://doi.org/10.1093/gigascience/giae047), where the paper and peer reviews are published openly under a CC-BY 4.0 license. These peer reviews were as follows:
  
  Reviewer 2: Dan Wang
  
  The manuscript provides a comprehensive background on the necessity and challenges of MOB typing in the context of plasmid genomics and its significance in tracking the transmission of resistance genes and virulence factors. The innovation introduced by MOBFinder, which incorporates an 11-class classification system, addresses a critical gap in current research methodologies by enhancing the precision of plasmid fragment classification. Key Strengths: Innovation: MOBFinder represents a novel approach in the typing of metagenomic plasmid fragments using word vector characterization combined with machine learning techniques. Methodological Rigor: The methodological approach, including the use of random forest models and the construction of a benchmark dataset from annotated complete plasmid genomes, is robust and well-executed. Performance: The tool demonstrates superior performance compared to existing tools like MOBscan and MOB-suite, providing a significant improvement in accuracy. Impact on Field: The application of MOBFinder in a T2D cohort illustrates the tool's practical utility and its potential to influence antibiotic resistance studies. Recommendation: Given the thorough revisions and the contributions this manuscript offers to the field of microbial genomics and antibiotic resistance, I recommend that the manuscript be accepted for publication in GigaScience.
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2023.12.06.570414v1
www.biorxiv.org www.biorxiv.org

CAT – A Computational Anatomy Toolbox for the Analysis of Structural MRI Data

3
1. GigaScience 12 Aug 2024
  
  in GigaScience
  
  AbstractA large range of sophisticated brain image analysis tools have been developed by the neuroscience community, greatly advancing the field of human brain mapping. Here we introduce the Computational Anatomy Toolbox (CAT) – a powerful suite of tools for brain morphometric analyses with an intuitive graphical user interface, but also usable as a shell script. CAT is suitable for beginners, casual users, experts, and developers alike providing a comprehensive set of analysis options, workflows, and integrated pipelines. The available analysis streams – illustrated on an example dataset – allow for voxel-based, surface-based, as well as region-based morphometric analyses. Notably, CAT incorporates multiple quality control options and covers the entire analysis workflow, including the preprocessing of cross-sectional and longitudinal data, statistical analysis, and the visualization of results. The overarching aim of this article is to provide a complete description and evaluation of CAT, while offering a citable standard for the neuroscience community.
  
  A version of this preprint has been published in the Open Access journal GigaScience (see paper https://doi.org/10.1093/gigascience/giae049), where the paper and peer reviews are published openly under a CC-BY 4.0 license. These peer reviews were as follows:
  
  Reviewer 3: Cyril Pernet
  
  CAT has been around for a long time and is a well maintained toolbox - the paper describes all the features and additionally provides tests/validations of those features. I have left a few comments on the pdf (uploaded) which I don't see has mandatory and thus 'accepted' the paper (and leave the authors to decide what to do with those comments). It provides a nice reference for the toolbox.
2. GigaScience 12 Aug 2024
  
  in GigaScience
  
  AbstractA large range of sophisticated brain image analysis tools have been developed by the neuroscience community, greatly advancing the field of human brain mapping. Here we introduce the Computational Anatomy Toolbox (CAT) – a powerful suite of tools for brain morphometric analyses with an intuitive graphical user interface, but also usable as a shell script. CAT is suitable for beginners, casual users, experts, and developers alike providing a comprehensive set of analysis options, workflows, and integrated pipelines. The available analysis streams – illustrated on an example dataset – allow for voxel-based, surface-based, as well as region-based morphometric analyses. Notably, CAT incorporates multiple quality control options and covers the entire analysis workflow, including the preprocessing of cross-sectional and longitudinal data, statistical analysis, and the visualization of results. The overarching aim of this article is to provide a complete description and evaluation of CAT, while offering a citable standard for the neuroscience community.
  
  A version of this preprint has been published in the Open Access journal GigaScience (see paper https://doi.org/10.1093/gigascience/giae049), where the paper and peer reviews are published openly under a CC-BY 4.0 license. These peer reviews were as follows:
  
  Reviewer 2: Chris Foulon
  
  Overall, I think the CAT software provides valuable tools to analyse morphometric differences in the brain and promotes open science. The study shows the software's capabilities rather well. However, I think some clarifications would help the readers understand and evaluate the quality of the methods.
  
  Comments: Figure 2: Looking at the chart, I have a question regarding the pipeline. Is it required to run the whole pipeline using CAT? Or is it possible to input already registered data to start directly with the VBM analysis or further?
  
  Voxel-based Processing: The above question is quite important, seeing that the preprocessing uses rather old registration methods. The users might want to use more recent registration methods, especially with clinical populations.
  
  Spatial Registration and Figure 3: For the registration, how is the registration performing with clinical populations (e.g. stroke patients)? It can be significant for the applicability of the methods with specific disorders.
  
  Surface Registration and Figure 3: What type of noise is used to evaluate the accuracy? This can be important as not every noise can be modelled easily, and some noises are more or less pronounced depending on the modality.
  
  Maybe having the letters of the figure panels referred to in the text would help the reader.
  
  Performance of CAT: Although I see the advantage of using simulated data, I think it would require more explanation. First, what tells the reader the quality of this simulated data, and how does it compare to real data? Second, is it only healthy data? In that case, the accuracy evaluation might not be relevant for the majority of the clinical studies using CAT.
  
  Longitudinal Processing: Are VBM analyses sensitive enough to capture changes over days? I would be surprised, but I would be interested to see studies doing it (and the readers would also benefit from it, I reckon).
  
  Mapping onto the Cortical Surface: I am a bit confused about the interest in mapping functional or diffusion parameters to the surface. Do you have examples of articles doing that? It sounds like it would waste a lot of information from these parameters, but I am not familiar with this type of analysis. "Optionally, CAT also allows mapping of voxel values at multiple positions along the surface normal at each node". I do not understand this sentence; I think it should be clarified.
  
  Example application: Is there a way to come back from the surface space to the volume space to compare the results? For example, VBM and SBM should provide fairly similar results, but comparing them is difficult when they are not in the same space. Additionally, in the end, the surface representation is just that, a representation; most other analyses are still done on the volume space, so it could be helpful to translate the result on the surface back to the volume (if it is not already available).
  
  Evaluation of CAT12: I was confused with Supplemental Figure 1 as it is not mentioned in the caption that it is the AD data and not the simulated one. Maybe it would help the reader to mention it.
  
  Regarding the reliability of CAT12, it seems to capture more things, but I struggle to see how we can be sure that this is "better" than other methods; couldn't it be false positives?
  
  "those achieved based on manual tracing and demonstrated that both approaches produced comparable hippocampal volume." comparable volumes do not really mean the same accuracy; this sentence could be misleading.
  
  I think the multiple studies show that CAT12 is as valid as any other tool but I am not sure the argument that it is better is as solid. Of course, I understand that there is no ground truth for what a relevant morphological change is for a given disease.
  
  Methods: Statistical Analysis: Why is the FWER correction used for the voxel-wise statistics (which perform many comparisons) and FDR used on ROI-wise statistics (which perform much fewer comparisons)? I would expect the opposite.
  
  "The outcomes of the VBM and voxel-based ROI analyses were overlaid onto orthogonal sections of the mean brain created from the entire study sample (n=50); " I don't understand what this refers to.
3. GigaScience 12 Aug 2024
  
  in GigaScience
  
  AbstractA large range of sophisticated brain image analysis tools have been developed by the neuroscience community, greatly advancing the field of human brain mapping. Here we introduce the Computational Anatomy Toolbox (CAT) – a powerful suite of tools for brain morphometric analyses with an intuitive graphical user interface, but also usable as a shell script. CAT is suitable for beginners, casual users, experts, and developers alike providing a comprehensive set of analysis options, workflows, and integrated pipelines. The available analysis streams – illustrated on an example dataset – allow for voxel-based, surface-based, as well as region-based morphometric analyses. Notably, CAT incorporates multiple quality control options and covers the entire analysis workflow, including the preprocessing of cross-sectional and longitudinal data, statistical analysis, and the visualization of results. The overarching aim of this article is to provide a complete description and evaluation of CAT, while offering a citable standard for the neuroscience community.
  
  A version of this preprint has been published in the Open Access journal GigaScience (see paper https://doi.org/10.1093/gigascience/giae049), where the paper and peer reviews are published openly under a CC-BY 4.0 license. These peer reviews were as follows:
  
  Reviewer 1: Chris Armit
  
  This Technical Note describes the Computational Anatomy Toolbox (CAT) software tool, which includes a Graphical User Interface that can be used for morphometric analysis of Structural MRI data. The CAT software tool is impressive, and enables voxel-based and surface-based morphometric analysis to be accomplished on Structural MRI data, and also voxel-based tissue segmentation and surface mesh generation to be applied to these 3D imaging datasets. The authors helpfully illustrate the utility of the Computational Anatomy Toolbox (CAT) using T1-weighted structural brain images from the Alzheimer's Disease Neuroimaging Initiative (ADNI) database.
  
  This is an excellent, freely available tool for the Neuroimaging community and the authors are to be commended for developing this impressive software tool.
  
  Minor comments
  
  I first attempted to launch the CAT software tool on macOS 14.0 (Sonoma) with Apple M1 chip, and on the command line I received the following message: "spm12" is damaged and can't be opened. You should move it to the Bin.
  
  I additionally tested the CAT software tool on macOS 12.6 (Monterey) with Intel chip, and I was able to run the CAT software tool on this platform.
  
  A minor criticism is that the installation instructions in the supporting Readme file for archive [CAT12.9_R2023b_MCR_Mac_arm64.zip], which runs on macOS with Intel chip, only details how to install the SPM (Statistical Parametric Mapping) software tool. The CAT software tool needs to be downloaded separately and then moved into the directory of the SPM toolbox, and these installation instructions are included in the supporting CAT software documentation (https://neuro-jena.github.io/cat12-help/#get_started)
  
  With the issues I encountered in installation, I invite the authors to list the System Requirements - specifically the Operating Systems that are needed to run the CAT software tool - in the GigaScience manuscript and also in the supporting CAT software documentation.
  
  In addition, it would be particularly helpful if the instructions on how to install CAT in the context of SPM were included in the supporting Readme files for the Computational Anatomy Toolbox (CAT) zip archives.
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2022.06.11.495736v2
www.biorxiv.org www.biorxiv.org

Deciphering Cancer Genomes with GenomeSpy: A Grammar-Based Visualization Toolkit

3
1. GigaScience 12 Aug 2024
  
  in GigaScience
  
  AbstractBackground Visualization is an indispensable facet of genomic data analysis. Despite the abundance of specialized visualization tools, there remains a distinct need for tailored solutions. However, their implementation typically requires extensive programming expertise from bioinformaticians and software developers, especially when building interactive applications. Toolkits based on visualization grammars offer a more accessible, declarative way to author new visualizations. Nevertheless, current grammar-based solutions fall short in adequately supporting the interactive analysis of large data sets with extensive sample collections, a pivotal task often encountered in cancer research.Results We present GenomeSpy, a grammar-based toolkit for authoring tailored, interactive visualizations for genomic data analysis. Users can implement new visualization designs with little effort by using combinatorial building blocks that are put together with a declarative language. These fully customizable visualizations can be embedded in web pages or end-user-oriented applications. The toolkit also includes a fully customizable but user-friendly application for analyzing sample collections, which may comprise genomic and clinical data. Findings can be bookmarked and shared as links that incorporate provenance information. A distinctive element of GenomeSpy’s architecture is its effective use of the graphics processing unit (GPU) in all rendering. GPU usage enables a high frame rate and smoothly animated interactions, such as navigation within a genome. We demonstrate the utility of GenomeSpy by characterizing the genomic landscape of 753 ovarian cancer samples from patients in the DECIDER clinical trial. Our results expand the understanding of the genomic architecture in ovarian cancer, particularly the diversity of chromosomal instability. We also show how GenomeSpy enabled the discovery of clinically actionable genomic aberrations.Conclusions GenomeSpy is a visualization toolkit applicable to a wide range of tasks pertinent to genome analysis. It offers high flexibility and exceptional performance in interactive analysis. The toolkit is open source with an MIT license, implemented in JavaScript, and available at https://genomespy.app/.
  
  A version of this preprint has been published in the Open Access journal GigaScience (see paper https://doi.org/10.1093/gigascience/giae040), where the paper and peer reviews are published openly under a CC-BY 4.0 license. These peer reviews were as follows:
  
  Reviewer 3: Luca Beltrame
  
  Lavikka and coworkers present an interesting visualization framework and associated application for genomics visualization. The challenges outlined by the authors in finding appropriate visualization tools for large-scale genomics data were also experienced by this reviewer, and thus better and improved tools are always welcome.
  
  The manuscript is well laid out, presenting the key facts in a proper manner. The use of GPU rendering for graphs is an excellent move, and I expect to be extremely useful even for machines with lower-end GPUs. The code looks reasonably written and commented (being an application, this too is important for a review). I have also tested the examples, and indeed the software is very useful (the documentation should, however, point out that some issues regarding saving the canvas still exist). One may argue that the use of JSON for the graph grammar can be awkward, but at the same time other file formats may be more problematic and/or require specialized parsers (which open yet another can of worms).
  
  Documentation is also logically organized. As a minor suggestion, the authors may want to add some form of search to their documentation page.
  
  There are is an open questions that the authors may want to answer: they explicitly mention GISTIC 1.0 for the G-score plots. Is there a specific reason why they chose 1.0? The 2.0 algorithm is far more robust and produces more reliable results.
2. GigaScience 12 Aug 2024
  
  in GigaScience
  
  AbstractBackground Visualization is an indispensable facet of genomic data analysis. Despite the abundance of specialized visualization tools, there remains a distinct need for tailored solutions. However, their implementation typically requires extensive programming expertise from bioinformaticians and software developers, especially when building interactive applications. Toolkits based on visualization grammars offer a more accessible, declarative way to author new visualizations. Nevertheless, current grammar-based solutions fall short in adequately supporting the interactive analysis of large data sets with extensive sample collections, a pivotal task often encountered in cancer research.Results We present GenomeSpy, a grammar-based toolkit for authoring tailored, interactive visualizations for genomic data analysis. Users can implement new visualization designs with little effort by using combinatorial building blocks that are put together with a declarative language. These fully customizable visualizations can be embedded in web pages or end-user-oriented applications. The toolkit also includes a fully customizable but user-friendly application for analyzing sample collections, which may comprise genomic and clinical data. Findings can be bookmarked and shared as links that incorporate provenance information. A distinctive element of GenomeSpy’s architecture is its effective use of the graphics processing unit (GPU) in all rendering. GPU usage enables a high frame rate and smoothly animated interactions, such as navigation within a genome. We demonstrate the utility of GenomeSpy by characterizing the genomic landscape of 753 ovarian cancer samples from patients in the DECIDER clinical trial. Our results expand the understanding of the genomic architecture in ovarian cancer, particularly the diversity of chromosomal instability. We also show how GenomeSpy enabled the discovery of clinically actionable genomic aberrations.Conclusions GenomeSpy is a visualization toolkit applicable to a wide range of tasks pertinent to genome analysis. It offers high flexibility and exceptional performance in interactive analysis. The toolkit is open source with an MIT license, implemented in JavaScript, and available at https://genomespy.app/.
  
  A version of this preprint has been published in the Open Access journal GigaScience (see paper https://doi.org/10.1093/gigascience/giae040), where the paper and peer reviews are published openly under a CC-BY 4.0 license. These peer reviews were as follows:
  
  **Reviewer 2: Alessandro Romanel **
  
  In this article, the authors introduce GenomeSpy, a grammar-based toolkit for creating customized, interactive visualizations for genomic data analysis. I find the article extremely interesting, and I believe the framework introduced by the authors has broad utility. The website is well-maintained and documented, and I particularly found the examples mentioned in the paper to be useful and informative. The authors chose to present their toolkit by narrating the navigation of a dataset generated in the DECIDER study. While the narrative makes the utility of the visualizations clear in data interpretation, what is not clear at all is how easy it is to use GenomeSpy to create those same visualizations. I believe that the success of a toolkit like this is strongly tied to its ease of use, and this aspect is not clear or prominently highlighted in the manuscript. Additionally, it would be interesting to more clearly highlight GenomeSpy's strengths compared to other approaches. By combining Rshiny and ggplot, it is indeed possible to create complex interactive data visualizations. Therefore, it would be necessary to more strongly emphasize what the other innovative aspects of GenomeSpy are, beyond GPU acceleration, compared to other approaches available today.
3. GigaScience 12 Aug 2024
  
  in GigaScience
  
  AbstractBackground Visualization is an indispensable facet of genomic data analysis. Despite the abundance of specialized visualization tools, there remains a distinct need for tailored solutions. However, their implementation typically requires extensive programming expertise from bioinformaticians and software developers, especially when building interactive applications. Toolkits based on visualization grammars offer a more accessible, declarative way to author new visualizations. Nevertheless, current grammar-based solutions fall short in adequately supporting the interactive analysis of large data sets with extensive sample collections, a pivotal task often encountered in cancer research.Results We present GenomeSpy, a grammar-based toolkit for authoring tailored, interactive visualizations for genomic data analysis. Users can implement new visualization designs with little effort by using combinatorial building blocks that are put together with a declarative language. These fully customizable visualizations can be embedded in web pages or end-user-oriented applications. The toolkit also includes a fully customizable but user-friendly application for analyzing sample collections, which may comprise genomic and clinical data. Findings can be bookmarked and shared as links that incorporate provenance information. A distinctive element of GenomeSpy’s architecture is its effective use of the graphics processing unit (GPU) in all rendering. GPU usage enables a high frame rate and smoothly animated interactions, such as navigation within a genome. We demonstrate the utility of GenomeSpy by characterizing the genomic landscape of 753 ovarian cancer samples from patients in the DECIDER clinical trial. Our results expand the understanding of the genomic architecture in ovarian cancer, particularly the diversity of chromosomal instability. We also show how GenomeSpy enabled the discovery of clinically actionable genomic aberrations.Conclusions GenomeSpy is a visualization toolkit applicable to a wide range of tasks pertinent to genome analysis. It offers high flexibility and exceptional performance in interactive analysis. The toolkit is open source with an MIT license, implemented in JavaScript, and available at https://genomespy.app/.
  
  A version of this preprint has been published in the Open Access journal GigaScience (see paper https://doi.org/10.1093/gigascience/giae040), where the paper and peer reviews are published openly under a CC-BY 4.0 license. These peer reviews were as follows:
  
  Reviewer 1: Andrea Sboner
  
  In this manuscript, the authors present Genome Spy, a visualization toolkit geared toward the rapid and interactive exploration of genomic features. They demonstrate how this tool can help investigators explore a large cohort of 753 ovarian cancers sequenced by whole-genome sequencing (WGS). By using the tool, they were able to identify outliers in the dataset and refine their diagnosis. The tool is inspired by Vega-lite, a high-level grammar for interactive graphics, and extends it for genomic applications.
  
  The manuscript is clearly written, and the authors provide links to the applications itself, tutorials and examples. I want to commend them for doing this. This is a tool that would nicely complement others and has a specific advantage of using high-performance GPUs that are now common in modern computers.
  
  The only concern that I have is about a couple of claims that may not be fully supported by the data provided: 1. Claim: users can implement new visualization designs easily. While the grammar certainly enables the users to define new designs, I do not think that this is necessarily easy, as the authors themselves recognize in the discussion section when they suggest providing templates to reduce the learning curve. Indeed, the example in Figure 2 is still quite verbose and would need some time for anyone to understand the syntax and the style. The playground web application facilitates testing it, though. 2. Claim: the grammar-based approach allows to be mixed and matched. I did not find any specific example of how to do this. It would have been quite interesting to see the intersection between the DNA representation of structural variants and RNA-seq data (if this is what it means as "mix and match").
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2023.10.06.561159v1
www.biorxiv.org www.biorxiv.org

Impact of reference design on estimating SARS-CoV-2 lineage abundances from wastewater sequencing data

2
1. GigaScience 12 Aug 2024
  
  in GigaScience
  
  AbstractBackground Sequencing of SARS-CoV-2 RNA from wastewater samples has emerged as a valuable tool for detecting the presence and relative abundances of SARS-CoV-2 variants in a community. By analyzing the viral genetic material present in wastewater, public health officials can gain early insights into the spread of the virus and inform timely intervention measures. The construction of reference datasets from known SARS-CoV-2 lineages and their mutation profies has become state-of-the-art for assigning viral lineages and their relative abundances from wastewater sequencing data. However, the selection of reference sequences or mutations directly affects the predictive power.Results Here, we show the impact of a mutation- and sequence-based reference reconstruction for SARS-CoV-2 abundance estimation. We benchmark three data sets: 1) synthetic “spike-in” mixtures, 2) German samples from early 2021, mainly comprising Alpha, and 3) samples obtained from wastewater at an international airport in Germany from the end of 2021, including 1rst signals of Omicron. The two approaches differ in sub-lineage detection, with the marker-mutation-based method, in particular, being challenged by the increasing number of mutations and lineages. However, the estimations of both approaches depend on selecting representative references and optimized parameter settings. By performing parameter escalation experiments, we demonstrate the effects of reference size and alternative allele frequency cutoffs for abundance estimation. We show how different parameter settings can lead to different results for our test data sets, and illustrate the effects of virus lineage composition of wastewater samples and references.Conclusions Here, we compare a mutation- and sequence-based reference construction and assignment for SARS-CoV-2 abundance estimation from wastewater samples. Our study highlights current computational challenges, focusing on the general reference design, which significantly and directly impacts abundance allocations. We illustrate advantages and disadvantages that may be relevant for further developments in the wastewater community and in the context of higher standardization.
  
  A version of this preprint has been published in the Open Access journal GigaScience (see paper https://doi.org/10.1093/gigascience/giae051), where the paper and peer reviews are published openly under a CC-BY 4.0 license. These peer reviews were as follows:
  
  **Reviewer 2: Liuyang Zhao **
  
  In this study, the authors initiate a novel exploration by employing parameter escalation experiments to assess the impact of reference size and alternative allele frequency cutoffs on the effects of virus lineage composition in wastewater samples and their references. The research provides valuable insights into how different parameter settings influence outcomes in test data sets, particularly highlighting the role of virus lineage composition in wastewater samples and the corresponding references. Detailed parameters for these analyses are made available in several bash files at osf.io/upbqj. Despite these significant contributions, certain areas could benefit from further enhancement:
  
  1.The current methodology utilizes Ion Torrent for testing mock samples. However, this approach may not fully capture the variability in alignment and sub-lineage analysis. Incorporating additional sequencing data from PacBio, Nanopore, and Illumina would offer a more comprehensive examination of these aspects, potentially leading to more robust findings.
  
  2.While the study showcases a variety of pipelines based on mutation-based and sequence-based tools in Table 1, the evaluation of three data sets was limited to only using MAMUSS (as a mutation-based reference) and VLQ-nf (as a sequence-based reference). For more conclusive guidance in pipeline selection, it is advisable for the authors to expand their analysis to include at least two or three more pipelines. This recommendation aligns with observations noted by the authors at line 619, suggesting a comprehensive benchmark comparison would significantly enhance the study's utility and appeal to readers seeking optimal pipeline strategies.
2. GigaScience 12 Aug 2024
  
  in GigaScience
  
  AbstractBackground Sequencing of SARS-CoV-2 RNA from wastewater samples has emerged as a valuable tool for detecting the presence and relative abundances of SARS-CoV-2 variants in a community. By analyzing the viral genetic material present in wastewater, public health officials can gain early insights into the spread of the virus and inform timely intervention measures. The construction of reference datasets from known SARS-CoV-2 lineages and their mutation profies has become state-of-the-art for assigning viral lineages and their relative abundances from wastewater sequencing data. However, the selection of reference sequences or mutations directly affects the predictive power.Results Here, we show the impact of a mutation- and sequence-based reference reconstruction for SARS-CoV-2 abundance estimation. We benchmark three data sets: 1) synthetic “spike-in” mixtures, 2) German samples from early 2021, mainly comprising Alpha, and 3) samples obtained from wastewater at an international airport in Germany from the end of 2021, including 1rst signals of Omicron. The two approaches differ in sub-lineage detection, with the marker-mutation-based method, in particular, being challenged by the increasing number of mutations and lineages. However, the estimations of both approaches depend on selecting representative references and optimized parameter settings. By performing parameter escalation experiments, we demonstrate the effects of reference size and alternative allele frequency cutoffs for abundance estimation. We show how different parameter settings can lead to different results for our test data sets, and illustrate the effects of virus lineage composition of wastewater samples and references.Conclusions Here, we compare a mutation- and sequence-based reference construction and assignment for SARS-CoV-2 abundance estimation from wastewater samples. Our study highlights current computational challenges, focusing on the general reference design, which significantly and directly impacts abundance allocations. We illustrate advantages and disadvantages that may be relevant for further developments in the wastewater community and in the context of higher standardization.
  
  A version of this preprint has been published in the Open Access journal GigaScience (see paper https://doi.org/10.1093/gigascience/giae051), where the paper and peer reviews are published openly under a CC-BY 4.0 license. These peer reviews were as follows:
  
  **Reviewer 1: Irene Bassano **
  
  In the manuscript "Impact of reference design on estimating SARS-CoV-2 lineage abundances from wastewater sequencing data" Aßmann et. al compare two methods, a sequence and mutation-based, respectively, to better understand the circulating lineages and sub-lineages in wastewater samples. Since the advent of wastewater-based epidemiology (WBE) as a tool to complement results from clinical data, there has been search for novel tools that can give robustness to the results and more importantly confidence in the data analysis. In this context, this manuscript is very important as it is contributing towards achieving that goal. This is clear in the fact that they have designed a new tool, namely MAMUSS. 1. One aspect however that the manuscript fails to mention is the difficulty in reconstructing full genome sequences from wastewater data. This has been one of the biggest problems since it is widely accepted that viral particles in water do degrade, and consequently what is being sequenced is a partial genome. Consensus sequences are therefore very difficult to obtain. 2. Another aspect that the authors fail to mention in the introduction or as a point of discussion, is how a variant is defined and how we take this information from clinical samples to then adopt it to define variants in environmental samples, although some relevant tools are mentioned such as COJAC and MMMVI. Yet, how these are used, it is not explained. 3. The manuscript is well written, there are some repetitive sentences that need to be removed (see comments on PDF) as well as a couple of sentences which are not grammatically correct (see comments on PDF). 4. It is worth mentioning that the words "variants" and "lineages" are used interchangeably. I do suggest they choose one term only. 5. The manuscript mentions several times the presence of false and true positive, however does not mention how these were calculated. These need to be supported by a small statistical test. 6. There are minor corrections throughout the manuscript that need to be address. All these are highlighted as comments in the original manuscript.
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2023.06.02.543047v1
www.biorxiv.org www.biorxiv.org

Kinship analysis and pedigree reconstruction by RAD sequencing in cattle

2
1. GigaScience 10 Aug 2024
  
  in GigaByte
  
  Editors Assessment:
  
  RAD-Seq (Restriction-site-associated DNA sequencing) is a cost-effective method for single nucleotide polymorphism (SNP) discovery and genotyping. In this study the authors performed a kinship analysis and pedigree reconstruction for two different cattle breeds (Angus and Xiangxi yellow cattle). A total of 975 cattle, including 923 offspring with 24 known sires and 28 known dams, were sampled and subjected to SNP discovery and genotyping using RAD-Seq. Producing a SNP panel with 7305 SNPs capturing the maximum difference between paternal and maternal genome information, and being able to distinguish between the F1 and F2 generation with 90% accuracy. Peer review helped highlight better the practical applications of this work. The combination of the efficiency of RNA-seq and advances in kinship analysis here can helpfully help improve breed management, local resource utilization, and conservation of livestock.
  
  This evaluation refers to version 1 of the preprint
  
  Summary
2. GigaScience 10 Aug 2024
  
  in GigaByte
  
  AbstractKinship and pedigree information, used for estimating inbreeding, heritability, selection, and gene flow, is useful for breeding and animal conservation. However, as the size of the crossbred population increases, inaccurate generation and parentage recoding in livestock farms increases. Restriction-site-associated DNA sequencing (RAD-Seq) is a cost-effective platform for single nucleotide polymorphism (SNP) discovery and genotyping. Here, we performed a kinship analysis and pedigree reconstruction for Angus and Xiangxi yellow cattle, which benefit from good meat quality and yields, providing a basis for livestock management. A total of 975 cattle, including 923 offspring with 24 known sires and 28 known dams, were sampled and subjected to SNP discovery and genotyping. The identified SNPs panel included 7305 SNPs capturing the maximum difference between paternal and maternal genome information allowing us to distinguish between the F1 and F2 generation with 90% accuracy. In addition, parentage assignment software based on different strategies verified that the cross-assignments. In conclusion, we provided a low-cost and efficient SNP panel for kinship analyses and the improvement of local genetic resources, which are valuable for breed improvement, local resource utilization, and conservation.
  
  This work has been published in GigaByte Journal under a CC-BY 4.0 license (https://doi.org/10.46471/gigabyte.131), and has published the reviews under the same license. These are as follows.
  
  Reviewer 1. Liyun wan
  
  Is there sufficient detail in the methods and data-processing steps to allow reproduction?
  
  The detailed parameters for the SNP and InDel calling should be described to allow reproduction.
  
  Additional Comments:
  
  This research provides valuable insights into the use of RAD-Seq to kinship analysis and pedigree reconstruction, which is useful for breeding and animal conservation purposes. Overall, the study is well-conducted and the findings are relevant. However, there are a few aspects that require attention before the manuscript can be considered for publication. Please address the following points: 1. Provide practical applications: Highlight the practical applications of your research in livestock management, breed improvement, local resource utilization, and conservation. Discuss how the low-cost and efficient SNP panel can contribute to these areas and provide suggestions for further research or implementation. 2. Language and clarity: Review the manuscript for clarity, grammar, and sentence structure. Ensure that all key terms and concepts are defined and explained to facilitate understanding for a broad readership. Once these revisions have been made, I believe the manuscript will be much stronger and suitable for publication.
  
  Reviewer 2. Mohammad Bagher Zandi
  
  Is the language of sufficient quality?
  
  Yes. It was great.
  
  Are all data available and do they match the descriptions in the paper?
  
  Yes. The raw sequencing reads were deposited but it would be better to share the the SNPs data as well.
  
  Is the data acquisition clear, complete and methodologically sound?
  
  No. SNPs detection and SNPs selection for assignment test is not clear.
  
  Is there sufficient detail in the methods and data-processing steps to allow reproduction?
  
  No. In some cases, the materials and methods section is vague. It is better to correct them. It is mentioned in the attached manuscript text.
  
  Additional Comments: Well done research, but the manuscript need some correction as commented on the attached file. See: https://gigabyte-review.rivervalleytechnologies.comdownload-api-file?ZmlsZV9wYXRoPXVwbG9hZHMvZ3gvRFIvNTA1L2dpZ2EtY29tZW50cy5kb2N4
Visit annotations in context

Tags

Summary

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2024.05.20.595066v1
www.biorxiv.org www.biorxiv.org

Chromosomal-level genome assembly and single-nucleotide polymorphism sites of black-faced spoonbill Platalea minor

2
1. GigaScience 10 Aug 2024
  
  in GigaByte
  
  Editors Assessment: This work is part of a series of papers from the Hong Kong Biodiversity Genomics Consortium sequencing the rich biodiversity of species in Hong Kong (see https://doi.org/10.46471/GIGABYTE_SERIES_0006). This example assembles the genome of the black-faced spoonbill (Platalea minor), an emblematic wading bird from East Asia that is classified as globally endangered by the IUCN. This Data Release reporting a 1.24Gb chromosomal-level genome assembly produced using a combination of PacBio SMRT and Omni-C scaffolding technologies. BUSCO and Merqury validation were carried out, gene models created, and peer reviewers also requested MCscan synteny analysis. This showed the genome assembly had high sequence continuity with scaffold length N50=53 Mb. Presenting data from 14 individuals this will hopefully be a useful and valuable resources for future population genomic studies aimed at better understanding spoonbill species numbers and conservation.
  
  *This evaluation refers to version 1 of the preprint *
  
  Summary
2. GigaScience 10 Aug 2024
  
  in GigaByte
  
  AbstractPlatalea minor, the black-faced spoonbill (Threskiornithidae) is a wading bird that is confined to coastal areas in East Asia. Due to habitat destruction, it has been classified by The International Union for Conservation of Nature (IUCN) as globally endangered species. Nevertheless, the lack of its genomic resources hinders our understanding of their biology, diversity, as well as carrying out conservation measures based on genetic information or markers. Here, we report the first chromosomal-level genome assembly of P. minor using a combination of PacBio SMRT and Omni-C scaffolding technologies. The assembled genome (1.24 Gb) contains 95.33% of the sequences anchored to 31 pseudomolecules. The genome assembly also has high sequence continuity with scaffold length N50 = 53 Mb. A total of 18,780 protein-coding genes were predicted, and high BUSCO score completeness (93.7% of BUSCO metazoa_odb10 genes) was also revealed. A total of 6,155,417 bi-allelic SNPs were also revealed from 13 P. minor individuals, accounting for ∼5% of the genome. The resource generated in this study offers the new opportunity for studying the black-faced spoonbill, as well as carrying out conservation measures of this ecologically important spoonbill species.
  
  This work is part of a series of papers presenting outputs of the Hong Kong Biodiversity Genomics https://doi.org/10.46471/GIGABYTE_SERIES_0006 This work has been published in GigaByte Journal under a CC-BY 4.0 license (https://doi.org/10.46471/gigabyte.130), and has published the reviews under the same license. These are as follows.
  
  Reviewer 1. Richard Flamio Jr.
  
  Is the language of sufficient quality?
  
  No. There are some grammatical errors and spelling mistakes throughout the text.
  
  Is there sufficient detail in the methods and data-processing steps to allow reproduction?
  
  Yes. The authors did a phenomenal job at detailing the methods and data-processing steps.
  
  Additional Comments:
  
  Very nice job on the paper. The methods are sound and the statistics regarding the genome assembly are thorough. My only two comments are: 1) I think the paper could be improved by the correction of grammatical errors, and 2) I am interested in a discussion about the number of chromosomes expected for this species (or an estimate) based on related species and if the authors believe all of the chromosomes were identified. For example, is the karyotype known or can the researchers making any inferences about the number of microchromosomes in the assembly? Please see a recent paper I wrote on microchromosomes in the wood stork assembly (https://doi.org/10.1093/jhered/esad077) for some ideas in defining the chromosome architecture of the spoonbill and/or comparing this architecture to related species.
  
  Re-review:
  
  The authors incorporated the revisions nicely and have produced a quality manuscript. Well done.
  
  Minor revisions Line 46: A comma is needed after (Threskiornithidae). Line 47: “The” should not be capitalized. Line 48: This should read “as a globally endangered species.” Line 49: “However, the lack of genomic resources for the species hinders the understanding of its biology…” Line 56: Consider changing “also revealed” to “identified” to avoid repetition from the previous sentence. Line 65: Insert “the” before “bird’s.” Lines 69-70: Move “locally” higher in the sentence – “and it is protected locally…” Line 72: Replace “as of to date” with “prior to this study”. Lines 78-79: Pluralize “part.” Line 86: Replace “proceeded” with “processed.” Line 133: “…are listed in Table 1.” Line 158: “accounted” Line 159: “Variant calling was performed using…” Line 161: “Hard filtering was employed…” Lines 200-201: “The heterozygosity levels… from five individuals were comparable to previous reports on spoonbills – black-faced spoonbill … and royal spoonbill … (Li et al. 2022).” Line 202: New sentence. “The remaining heterozygosity levels observed…” Line 206: “…genetic bottleneck in the black-faced spoonbill…” Lines 208-209: “These results highlight the need…” Lines 213-214: “…which are useful and precious resources for future population genomic studies aimed at better understanding spoonbill species numbers and conservation.” Line 226: Missing a period after “heterozygosity.” For references, consider adding DOIs. Some citations have them but most citations would benefit from this addition.
  
  Reviewer 2. Phred Benham
  
  Is the language of sufficient quality?
  
  Generally yes, the language is sufficiently clear. However, a number of places could be refined and extra words removed.
  
  Are all data available and do they match the descriptions in the paper?
  
  Additional data is available on figshare.
  
  I do not see any of the tables that are cited in the manuscript and contain legends. Am I missing something. Also there is no legend for the GenomeScope profile in figure 3.
  
  The assembly appears to be on genbank as a scaffold level assembly, can you list this accession info in the data availability section in addition to the project number.
  
  Is there sufficient data validation and statistical analyses of data quality?
  
  Overall fine, but some additional analyses would aid the paper. Comparison of the spoonbill genome to other close relatives using a synteny plot would be helpful.
  
  It would also be useful to put heterozygosity and inbreeding coefficients into context by comparing to results from other species.
  
  Additional Comments:
  
  Hui et al. report a chromosome level genome for the black-faced spoonbill, a endangered species of coastal wetlands in East Asia. This genome will serve as an important genome for understanding the biology of and conserving this species.
  
  Generally, the methods are sound and appropriate for the generation of genomic sequence.
  
  Major comments: This is a highly contiguous genome in line with metrics for Vertebrate Genomics Project genomes and other consortia. The authors argue that they have assembled 31 Pseudo-molecules or chromosomes. It would be nice to see a plot showing synteny of these 31 chromosomes and a closely related species with a chromosome level assembly (e.g. Theristicus caerulescens; GCA_020745775.1)
  
  The tables appear to be missing from the submitted manuscript?
  
  Minor comments: Line 49: delete its
  
  Line 49-51: This sentence is a little awkward, please revise.
  
  Line 64: delete 'the'
  
  Line 67: replace 'with' with 'the spoonbil as a'
  
  Line 68: delete 'Interestingly'
  
  Line 70: can you be more specific about what kind of genetic methods had previously been performed?
  
  Line 79: can you provide any additional details on the necessary permits and/or institutional approval
  
  Line 78: what kind of tissue? or were these blood samples?
  
  Line 110: do you mean movies?
  
  Line 143: replace data with dataset
  
  Line 163: it may be worth applying some additional filters in vcftools, e.g. minor allele freq., min depth, max depth, what level of missing data was allowed?, etc.
  
  Line 171: delete 'resulted in'
  
  Line 172: do you mean scaffold L50 was 8? Line 191-195: some context would be useful here, how does this level of heterozygosity and inbreeding compare to other waterbirds?
  
  Line 217: why did you use the Metazoan database and not the Aves_odb10 database for Busco?
  
  Figure 1b: Number refers to what, scaffolds? Be consistent with capitalization for Mb. It seems like the order of scaffold N50 and L50 were reversed.
  
  Figure 3 is missing a legend. Hui et al. report a chromosome level genome for the black-faced spoonbill, a endangered species of coastal wetlands in East Asia. This genome will serve as an important genome for understanding the biology of and conserving this species.
  
  Generally, the methods are sound and appropriate for the generation of genomic sequence.
  
  Major comments: This is a highly contiguous genome in line with metrics for Vertebrate Genomics Project genomes and other consortia. The authors argue that they have assembled 31 Pseudo-molecules or chromosomes. It would be nice to see a plot showing synteny of these 31 chromosomes and a closely related species with a chromosome level assembly (e.g. Theristicus caerulescens; GCA_020745775.1)
  
  The tables appear to be missing from the submitted manuscript?
  
  Minor comments: Line 49: delete its
  
  Line 49-51: This sentence is a little awkward, please revise.
  
  Line 64: delete 'the'
  
  Line 67: replace 'with' with 'the spoonbil as a'
  
  Line 68: delete 'Interestingly'
  
  Line 70: can you be more specific about what kind of genetic methods had previously been performed?
  
  Line 79: can you provide any additional details on the necessary permits and/or institutional approval
  
  Line 78: what kind of tissue? or were these blood samples?
  
  Line 110: do you mean movies?
  
  Line 143: replace data with dataset
  
  Line 163: it may be worth applying some additional filters in vcftools, e.g. minor allele freq., min depth, max depth, what level of missing data was allowed?, etc.
  
  Line 171: delete 'resulted in'
  
  Line 172: do you mean scaffold L50 was 8? Line 191-195: some context would be useful here, how does this level of heterozygosity and inbreeding compare to other waterbirds?
  
  Line 217: why did you use the Metazoan database and not the Aves_odb10 database for Busco?
  
  Figure 1b: Number refers to what, scaffolds? Be consistent with capitalization for Mb. It seems like the order of scaffold N50 and L50 were reversed.
  
  Figure 3 is missing a legend. Re-review:
  
  I previously reviewed this manuscript and overall the authors have done a nice job addressing all of my comments.
  
  I appreciate that the authors include the MCscan analysis that I suggested. However, the alignment of the P. minor assembly and annotations to other genomes suggests rampant mis-assembly or translocations. Birds have fairly high synteny and I would expect Pmin to look more similar to the comparison between T. caerulescens and M. americana in the MCscan plot. For instance, parts of the largest scaffold in the Pmin assembly map to multiple different chromosomes in the Tcae assembly. Similarly, the Z in Tcae maps to 11 different scaffolds in the Pmin assembly and there does not appear to be a single large scaffold in the Pmin assembly that corresponds to the Z chromosome.
  
  The genome seems to be otherwise of strong quality, so I urge the authors to double-check their MCscan synteny analysis. If this pattern remains, can you please add some comments about it to the end of the Data Validation and Quality Control section? I think other readers will also be surprised at the low levels of synteny apparent between the spoonbill and ibis assemblies.
Visit annotations in context

Tags

Summary

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2024.04.08.588650v1
Jul 2024
www.biorxiv.org www.biorxiv.org

Unveiling Vertebrate Development Dynamics in Frog Xenopus laevis using Micro-CT Imaging

4
1. GigaScience 31 Jul 2024
  
  in GigaScience
  
  Background Xenopus laevis, the African clawed frog, is a versatile vertebrate model organism employed across various biological disciplines, prominently in developmental biology to elucidate the intricate processes underpinning body plan reorganization during metamorphosis. Despite its widespread utility, a notable gap exists in the availability of comprehensive datasets encompassing Xenopus’ late developmental stages.Findings In the present study, we harnessed micro-computed tomography (micro-CT), a non-invasive 3D imaging technique utilizing X-rays to examine structures at a micrometer scale, to investigate the developmental dynamics and morphological changes of this crucial vertebrate model. Our approach involved generating high-resolution images and computed 3D models of developing Xenopus specimens, spanning from premetamorphosis tadpoles to fully mature adult frogs. This extensive dataset enhances our understanding of vertebrate development and is adaptable for various analyses. For instance, we conducted a thorough examination, analyzing body size, shape, and morphological features, with a specific emphasis on skeletogenesis, teeth, and organs like the brain at different stages. Our analysis yielded valuable insights into the morphological changes and structure dynamics in 3D space during Xenopus’ development, some of which were not previously documented in such meticulous detail. This implies that our datasets effectively capture and thoroughly examine Xenopus specimens. Thus, these datasets hold the solid potential for additional morphological and morphometric analyses, including individual segmentation of both hard and soft tissue elements within Xenopus.Conclusions Our repository of micro-CT scans represents a significant resource that can enhance our understanding of Xenopus’ development and the associated morphological changes. The widespread utility of this amphibian species, coupled with the exceptional quality of our scans, which encompass a comprehensive series of developmental stages, opens up extensive opportunities for their broader research application. Moreover, these scans have the potential for use in virtual reality, 3D printing, and educational contexts, further expanding their value and impact.
  
  This work has been peer reviewed in GigaScience (see paper), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer name: Virgilio Gail Ponferrada (R1)
  
  Thanks to the authors for accommodating the reviewers' suggestions. The manuscript continues to be well constructed and easy to read. I appreciate the addition of micro-CT analysis of Xenopus gut development and the inclusion of scans of additional samples for statistical analysis bolstering their findings. Should the manuscript be accepted for publication, perhaps the authors will contact Xenbase (www.xenbase.org), the Xenopus research database, as an additional means of featuring their micro-CT datasets. I suggest this manuscript be accepted for publication.
2. GigaScience 31 Jul 2024
  
  in GigaScience
  
  Background Xenopus laevis, the African clawed frog, is a versatile vertebrate model organism employed across various biological disciplines, prominently in developmental biology to elucidate the intricate processes underpinning body plan reorganization during metamorphosis. Despite its widespread utility, a notable gap exists in the availability of comprehensive datasets encompassing Xenopus’ late developmental stages.Findings In the present study, we harnessed micro-computed tomography (micro-CT), a non-invasive 3D imaging technique utilizing X-rays to examine structures at a micrometer scale, to investigate the developmental dynamics and morphological changes of this crucial vertebrate model. Our approach involved generating high-resolution images and computed 3D models of developing Xenopus specimens, spanning from premetamorphosis tadpoles to fully mature adult frogs. This extensive dataset enhances our understanding of vertebrate development and is adaptable for various analyses. For instance, we conducted a thorough examination, analyzing body size, shape, and morphological features, with a specific emphasis on skeletogenesis, teeth, and organs like the brain at different stages. Our analysis yielded valuable insights into the morphological changes and structure dynamics in 3D space during Xenopus’ development, some of which were not previously documented in such meticulous detail. This implies that our datasets effectively capture and thoroughly examine Xenopus specimens. Thus, these datasets hold the solid potential for additional morphological and morphometric analyses, including individual segmentation of both hard and soft tissue elements within Xenopus.Conclusions Our repository of micro-CT scans represents a significant resource that can enhance our understanding of Xenopus’ development and the associated morphological changes. The widespread utility of this amphibian species, coupled with the exceptional quality of our scans, which encompass a comprehensive series of developmental stages, opens up extensive opportunities for their broader research application. Moreover, these scans have the potential for use in virtual reality, 3D printing, and educational contexts, further expanding their value and impact.
  
  This work has been peer reviewed in GigaScience (see paper), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer name: John Wallingford (Original submission)
  
  Laznovsky et al. present a nice compendium of micro-CT-based digital volumes of several stages of Xenopus development. Given the prominence of this important model animal in studies of developmental biology and physiology, this dataset is quite useful and will be of interest to the community. That said, the study has some key limitations that will limit its utility for the research community, though these do not reduce the dataset's impact in the education and popular science realms, which is also a stated goal for the paper. Overall, I recommend publication after an effort has been made to address the following concerns.
  
  The atlas adequately samples developmental stages from late tadpole through metamorphosis. However, as far as I can tell only a single sample has been imaged at each stage. Thus, the quantifications of inter-stage differences shown here (Fig. 2, 4, 5) are at best very rough estimates and also provide no information about intra-stage variability in these metrics. This is not a fatal weakness, but it is an important caveat that I believe should be very explicitly stated in the text and in the figure legend of relevant figures.
  
  I am very disappointed that the rich history of microCT on Xenopus seems to have been entirely ignored by these authors. MicroCT has already been used to describe the skull, the brain, liver, blood vessels, etc. during Xenopus development. (Just a few papers the authors should read are: Slater et al., PLoS One 2009; Senevirathnea et al., PNAS, 2019; Ishii et al., Dev. Growth, Diff. 2023; Zhu et al., Front. Zool 2020.) It has also been used for comparative studies of other frogs (Kondo et al., Dev. Growth, Diff. 2022; Kraus, Anat. Rec. 2021; Jandausch et al., Zool. Anz. 2022; Paluh, et al., Evolution 2021, Paluh et al., eLife 2021). None of these -or the many other relevant papers- are discussed or cited here. The research community would be much better served if authors make a serious effort to integrate their methods and their results into this existing literature.
  
  An opportunity may have been missed here to provide some truly new biological insights: The gut remodels substantially during metamorphosis, but to my knowledge that has NOT be previously examined by microCT. It may not work, as the gut may simply be too soft to visualize, but then again, it may be worth trying.
3. GigaScience 31 Jul 2024
  
  in GigaScience
  
  Background Xenopus laevis, the African clawed frog, is a versatile vertebrate model organism employed across various biological disciplines, prominently in developmental biology to elucidate the intricate processes underpinning body plan reorganization during metamorphosis. Despite its widespread utility, a notable gap exists in the availability of comprehensive datasets encompassing Xenopus’ late developmental stages.Findings In the present study, we harnessed micro-computed tomography (micro-CT), a non-invasive 3D imaging technique utilizing X-rays to examine structures at a micrometer scale, to investigate the developmental dynamics and morphological changes of this crucial vertebrate model. Our approach involved generating high-resolution images and computed 3D models of developing Xenopus specimens, spanning from premetamorphosis tadpoles to fully mature adult frogs. This extensive dataset enhances our understanding of vertebrate development and is adaptable for various analyses. For instance, we conducted a thorough examination, analyzing body size, shape, and morphological features, with a specific emphasis on skeletogenesis, teeth, and organs like the brain at different stages. Our analysis yielded valuable insights into the morphological changes and structure dynamics in 3D space during Xenopus’ development, some of which were not previously documented in such meticulous detail. This implies that our datasets effectively capture and thoroughly examine Xenopus specimens. Thus, these datasets hold the solid potential for additional morphological and morphometric analyses, including individual segmentation of both hard and soft tissue elements within Xenopus.Conclusions Our repository of micro-CT scans represents a significant resource that can enhance our understanding of Xenopus’ development and the associated morphological changes. The widespread utility of this amphibian species, coupled with the exceptional quality of our scans, which encompass a comprehensive series of developmental stages, opens up extensive opportunities for their broader research application. Moreover, these scans have the potential for use in virtual reality, 3D printing, and educational contexts, further expanding their value and impact.
  
  This work has been peer reviewed in GigaScience (see paper), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer name: Virgilio Gail Ponferrada (Original submission)
  
  The manuscript is well written and easy to understand. It will be a good contribution to the Xenopus research community as well as a useful reference for the field of developmental and amphibian biology.
  
  I suggest the following revisions: - For the graphical abstract try alternating NF stage numbers above and below samples for a cleaner look, adult male and adult female can both remain at the top. - Appreciate the rationale for providing the microCT analysis presented in this manuscript and choices of late stage tadpoles, pre- and prometamorphosis, through metamporphosis to the adult male and female frog. - For the head development section authors can make reference to the Xenhead drawings, Zahn et al. Development 2017. - Head Development section paragraph 4, change word from "gender" to "sex." - Supplementary Table 3. Change "gender-related" to "sex-related." - Micro-CT Data Analysis of Long Bone Growth Dynamics section paragraph 1 change "in terms of gender" to "in terms of sex." - Figure 4 panels A and B don't reflect the observation that adult females are enlarged males. While the authors state that the view of the male and female skeletons are maximized and not proportional as stated in the caption, suggest that scale bars be employed and the images adjusted to show the size relationship difference between the sexes as in Figure 1. On first glance and perhaps to those not as familiar with the difference in sex size in Xenopus that in this particular example of the adult male image being more spread out compared to the image of the female, it feels misleading. - Ossification Analysis section paragraph 2 change "frog's gender" to "frog's sex." - Figure 5 panel A, the label is overlapping "NF 59." For panels B and B' scale bars on these panels would help the reader understand the proportions. Yes, there is the 3mm scale bar from panel A and as stated in the caption, but including them in the B panels could help even if panel B had a scale bar labeled at 0.25 mm and panel B' was 3 mm. - Segmentation of Selected Internal Soft Organ section, perhaps more commentary on the ability to observe the development of the segmentation of the brain regions: cbh: cerebral hemispheres; cbl: cerebellum; dch: diencephalon; mob: medulla oblongata; opl: optic lobes; sp: spinal cord while clearly shown in Figure 6, some accompanying description in the text would help readers in general or give the implication that microCT analysis of mutant or diseased frogs could help identify physical characteristics of frogs with developmental or neurological disorders. This would help transition from the analysis of a specific organ to the next section Further Biological Potential of Xenopus's Data. - These analyses, while thorough accompanied by novel visuals, require statistical implementation of multiple tadpoles and frogs per NF stage to account for variation in samples and to bolster the claims stated in skull thickness, the head mass and eye distance changes, increased length of the long bones during maturation, and femural ossification cartilage to bone ratios. This may constitute a suggested major revision to perform these analyses.
4. GigaScience 31 Jul 2024
  
  in GigaScience
  
  Background Xenopus laevis, the African clawed frog, is a versatile vertebrate model organism employed across various biological disciplines, prominently in developmental biology to elucidate the intricate processes underpinning body plan reorganization during metamorphosis. Despite its widespread utility, a notable gap exists in the availability of comprehensive datasets encompassing Xenopus’ late developmental stages.Findings In the present study, we harnessed micro-computed tomography (micro-CT), a non-invasive 3D imaging technique utilizing X-rays to examine structures at a micrometer scale, to investigate the developmental dynamics and morphological changes of this crucial vertebrate model. Our approach involved generating high-resolution images and computed 3D models of developing Xenopus specimens, spanning from premetamorphosis tadpoles to fully mature adult frogs. This extensive dataset enhances our understanding of vertebrate development and is adaptable for various analyses. For instance, we conducted a thorough examination, analyzing body size, shape, and morphological features, with a specific emphasis on skeletogenesis, teeth, and organs like the brain at different stages. Our analysis yielded valuable insights into the morphological changes and structure dynamics in 3D space during Xenopus’ development, some of which were not previously documented in such meticulous detail. This implies that our datasets effectively capture and thoroughly examine Xenopus specimens. Thus, these datasets hold the solid potential for additional morphological and morphometric analyses, including individual segmentation of both hard and soft tissue elements within Xenopus.Conclusions Our repository of micro-CT scans represents a significant resource that can enhance our understanding of Xenopus’ development and the associated morphological changes. The widespread utility of this amphibian species, coupled with the exceptional quality of our scans, which encompass a comprehensive series of developmental stages, opens up extensive opportunities for their broader research application. Moreover, these scans have the potential for use in virtual reality, 3D printing, and educational contexts, further expanding their value and impact.
  
  This work has been peer reviewed in GigaScience (see paper), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer name: Brian Metscher (Original submission)
  
  The authors present a set of 3D images of selected developmental stages of the widely-used laboratory model Xenopus laevis along with some examples of how the data might be used in developmental analyses. The dataset covers stages from mid-larva through metamorphosis to adult, which should provide a starting point for various studies of morphological development. Some studies will undoubtedly require other stages or more detailed images, but the presented data were collected with straightforward methods that will allow compatibility with future work.
  
  The data appear to be sound in the collection and curation. Data availability is made clear in the article, and the complete set will be publicly available in standard formats on the Zenodo repository. This should ensure full accessibility to anyone interested. The article is well-organized and clearly written.
  
  A few points about the methods could be clarified: Was only one specimen per stage scanned? Specimens were dehydrated through an ethanol series and then stained with free iodine in 90% methanol, and then rehydrated back through ethanol. Why was methanol used for the staining and not dehydration? It seems odd to switch alcohols back and forth without intermediate steps. This could have some effect on tissue shrinkage. It should be indicated that the X-ray source target is tungsten (even though it is unlikely to be anything else in this machine). The "real images" (p. 7) in Suppl. Fig. 1 should simply be called photographs - microCT images are real too. For the measurements of bone mass, is the cartilage itself actually visible in the microCT images? p. 13: "The dataset's diverse species representation…" What does this mean? It is only one species. The limitations on the image data are not discussed. All images have limits to their useful resolution and contrast among components; this is not a weakness, just a reality of imaging. The different reconstructed voxel sizes for different size specimens are mentioned, but it might be helpful to indicate the voxel sizes in Figure 1 as well as in the relevant table. And if the middle column of Figure 1 could be published with full resolution of the snapshots it would help show the actual quality of the images.
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2024.06.11.598452v1
www.biorxiv.org www.biorxiv.org

gNOMO2: a comprehensive and modular pipeline for integrated multi-omics analyses of microbiomes

3
1. GigaScience 31 Jul 2024
  
  in GigaScience
  
  Background Over the past few years, the rise of omics technologies has offered an exceptional chance to gain a deeper insight into the structural and functional characteristics of microbial communities. As a result, there is a growing demand for user friendly, reproducible, and versatile bioinformatic tools that can effectively harness multi-omics data to offer a holistic understanding of microbiomes. Previously, we introduced gNOMO, a bioinformatic pipeline specifically tailored to analyze microbiome multi-omics data in an integrative manner. In response to the evolving demands within the microbiome field and the growing necessity for integrated multi-omics data analysis, we have implemented substantial enhancements to the gNOMO pipeline.Results Here, we present gNOMO2, a comprehensive and modular pipeline that can seamlessly manage various omics combinations, ranging from two to four distinct omics data types including 16S rRNA gene amplicon sequencing, metagenomics, metatranscriptomics, and metaproteomics. Furthermore, gNOMO2 features a specialized module for processing 16S rRNA gene amplicon sequencing data to create a protein database suitable for metaproteomics investigations. Moreover, it incorporates new differential abundance, integration and visualization approaches, all aimed at providing a more comprehensive toolkit and insightful analysis of microbiomes. The functionality of these new features is showcased through the use of four microbiome multi-omics datasets encompassing various ecosystems and omics combinations. gNOMO2 not only replicated most of the primary findings from these studies but also offered further valuable perspectives.Conclusions gNOMO2 enables the thorough integration of taxonomic and functional analyses in microbiome multi-omics data, opening up avenues for novel insights in the field of both host associated and free-living microbiome research. gNOMO2 is available freely at https://github.com/muzafferarikan/gNOMO2.
  
  This work has been peer reviewed in GigaScience (see paper), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer name: Yuan Jiang (R1)
  
  The authors have fully addressed my comments.
2. GigaScience 31 Jul 2024
  
  in GigaScience
  
  Background Over the past few years, the rise of omics technologies has offered an exceptional chance to gain a deeper insight into the structural and functional characteristics of microbial communities. As a result, there is a growing demand for user friendly, reproducible, and versatile bioinformatic tools that can effectively harness multi-omics data to offer a holistic understanding of microbiomes. Previously, we introduced gNOMO, a bioinformatic pipeline specifically tailored to analyze microbiome multi-omics data in an integrative manner. In response to the evolving demands within the microbiome field and the growing necessity for integrated multi-omics data analysis, we have implemented substantial enhancements to the gNOMO pipeline.Results Here, we present gNOMO2, a comprehensive and modular pipeline that can seamlessly manage various omics combinations, ranging from two to four distinct omics data types including 16S rRNA gene amplicon sequencing, metagenomics, metatranscriptomics, and metaproteomics. Furthermore, gNOMO2 features a specialized module for processing 16S rRNA gene amplicon sequencing data to create a protein database suitable for metaproteomics investigations. Moreover, it incorporates new differential abundance, integration and visualization approaches, all aimed at providing a more comprehensive toolkit and insightful analysis of microbiomes. The functionality of these new features is showcased through the use of four microbiome multi-omics datasets encompassing various ecosystems and omics combinations. gNOMO2 not only replicated most of the primary findings from these studies but also offered further valuable perspectives.Conclusions gNOMO2 enables the thorough integration of taxonomic and functional analyses in microbiome multi-omics data, opening up avenues for novel insights in the field of both host associated and free-living microbiome research. gNOMO2 is available freely at https://github.com/muzafferarikan/gNOMO2.
  
  This work has been peer reviewed in GigaScience (see paper), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer name: Yuan Jiang (original submission)
  
  Referee Report for "gNOMO2: a comprehensive and modular pipeline for integrated multi-omics analyses of microbiomes"
  
  This paper introduced gNOMO2, a new version of gNOMO, which is a bioinformatic pipeline for multiomic management and analysis of microbiomes. The authors claimed that gNOMO2 incorporates new differential abundance, integration, and visualization tools compared to gNOMO. However, these new features as well as the distinction between gNOMO2 and gNOMO has not been clearly presented in the paper. In addition, the Methods section is written as a pipeline of bioinformatic tools and it is not clear what these tools are used for unless one is familiar with all the bioinformatic tools.
  
  My major comments are as follows:
  
  Given the existing work on gNOMO, it is critical for the authors to distinguish gNOMO2 from gNOMO to show its novelty. In the Methods section, the authors presented the six modules of gNOMO2. Are these all new from gNOMO, or does gNOMO included some of these functions? A clearer presentation of gNOMO2 versus gNOMO is needed.
  
  The authors did not present the methods in each module very well. For example, the authors wrote in Module 2 that "MaAsLin2 [31] is employed to determine differentially abundant taxa based on both AS and MP data. Furthermore, a joint visualization of MP and AS results is performed using the combi R package [32]. The final outputs include AS and MP based abundance tables, results from differential abundance analysis, and joint visualization analysis results." Without reading the references 31 and 32, it is very hard to understand what this module is really doing.
  
  The authors used the term "integrated multi-omics analysis" in all six modules of gNOMO2. It is not clear how this terms really means. It reads like that it is not really integrated analysis, instead, it is more like a module that can handle different types of data separately, such as differential abundance analysis for each type. What other integration has been used except joint visualization? What new integration tools have been incorporated in gNOMO2?
  
  In the differential abundance analysis, does the pipeline consider the features of microbiome data, such as their count, sparsity, and compositional features? Can the modules incorporate covariates in their differential abundance analysis? It is quite useful to have covariates adjusted in a differential abundance analysis?
  
  In the Analyses section, the authors applied gNOMO2 to re-analyze samples from previously published studies. They found some discrepancy between their results and the ones in the literature. Although some discrepancy is normal, the authors need to explain better what causes the discrepancy and whether it could yield different biological conclusions.
3. GigaScience 31 Jul 2024
  
  in GigaScience
  
  Background Over the past few years, the rise of omics technologies has offered an exceptional chance to gain a deeper insight into the structural and functional characteristics of microbial communities. As a result, there is a growing demand for user friendly, reproducible, and versatile bioinformatic tools that can effectively harness multi-omics data to offer a holistic understanding of microbiomes. Previously, we introduced gNOMO, a bioinformatic pipeline specifically tailored to analyze microbiome multi-omics data in an integrative manner. In response to the evolving demands within the microbiome field and the growing necessity for integrated multi-omics data analysis, we have implemented substantial enhancements to the gNOMO pipeline.Results Here, we present gNOMO2, a comprehensive and modular pipeline that can seamlessly manage various omics combinations, ranging from two to four distinct omics data types including 16S rRNA gene amplicon sequencing, metagenomics, metatranscriptomics, and metaproteomics. Furthermore, gNOMO2 features a specialized module for processing 16S rRNA gene amplicon sequencing data to create a protein database suitable for metaproteomics investigations. Moreover, it incorporates new differential abundance, integration and visualization approaches, all aimed at providing a more comprehensive toolkit and insightful analysis of microbiomes. The functionality of these new features is showcased through the use of four microbiome multi-omics datasets encompassing various ecosystems and omics combinations. gNOMO2 not only replicated most of the primary findings from these studies but also offered further valuable perspectives.Conclusions gNOMO2 enables the thorough integration of taxonomic and functional analyses in microbiome multi-omics data, opening up avenues for novel insights in the field of both host associated and free-living microbiome research. gNOMO2 is available freely at https://github.com/muzafferarikan/gNOMO2.
  
  This work has been peer reviewed in GigaScience (see paper), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer name: Alexander Bartholomaus (original submission)
  
  Summary: "gNOMO2: a comprehensive and modular pipeline for integrated multi-omics analyses of microbiomes" by ArÄ±kan and Muth presents a multi-omics tools for analysis of prokaryotes. It is an evolution of the first version and offers various separate modules, taking different type of input data. They present different example analysis based on already published data and reproduced the results. The manuscript is very well written (I could not detect a single typo) and it was fun to read! Well done! I have only very few comments and suggestions, see below. However, I had a problem executing the code.
  
  Key questions to answer: 1) Are the methods appropriate to the aims of the study, are they well described, and are necessary controls included? Yes 2) Are the conclusions adequately supported by the data shown? Yes 3) Please indicate the quality of language in the manuscript. Does it require a heavy editing for language and clarity? Very well written! 4) Are you able to assess all statistics in the manuscript, including the appropriateness of statistical tests used? No direct statistics given in the manuscript. Maybe the authors could include some example output as .zip file for interested potential users.
  
  Detailed comments to the manuscript: Line 168: What does "cleaned and redundancies are removed" mean? Are only identical genomes removed? Or are genome part that are identical (I guess this barely exists, except for conserved gene parts as 16S, or similar) removed? Or are only redundant genes removed? How is redundancy defined, 99% identical stretch? Line 399-405: When looking at figure 5A I am wondering how Fluviicoccus and Methanosarcina in the MP faction appear relatively abundant in some samples. Where they de novo assembled in the MG or MT modules? General comment figures: I know that it is a hack to deal with automatic figure generation and especially the axis labels (as names have very different length). However, I think some figures might be hardly visable in the printed version, especially axes label for panel B are very small. Maybe you can put the critical figures separately in the supplement, e.g. each B panel a one page.
  
  Suggestions: As suggest above, maybe the authors could include some example output (a simple example) as .zip file for interested potential users. This would give an idea of how the output looks like and what to expect besides the plots. But differential abundance tables might be more important than the plots, as the user would generate their own plot for later publications.
  
  Github and software: I also tested the software and followed the instructions in the Github. I successfully executed the "Requirements" and "Config" steps (including create of metadata file and copying of amplicon reads) and tried to execute Modul1.
  
  However, the following error occurred (using up-to-date conda and snakemake on Ubuntu linux): (snakemake) abartho@gmbs17:~/review_papers/GigaScience/gNOMO2$ snakemake -v 6.15.5 (snakemake) abartho@gmbs17:~/review_papers/GigaScience/gNOMO2$ snakemake -s workflow/Snakefile --cores 20 SyntaxError in line 9 of /home/abartho/miniconda3/envs/snakemake/lib/python3.6/sitepackages/smart_open/s3.py: future feature annotations is not defined (s3.py, line 9) File "/home/abartho/miniconda3/envs/snakemake/lib/python3.6/sitepackages/smart_open/init.py", line 34, in <module> File "/home/abartho/miniconda3/envs/snakemake/lib/python3.6/sitepackages/smart_open/smart_open_lib.py", line 35, in <module> File "/home/abartho/miniconda3/envs/snakemake/lib/python3.6/sitepackages/smart_open/doctools.py", line 21, in <module> File "/home/abartho/miniconda3/envs/snakemake/lib/python3.6/sitepackages/smart_open/transport.py", line 104, in <module> File "/home/abartho/miniconda3/envs/snakemake/lib/python3.6/sitepackages/smart_open/transport.py", line 49, in register_transport File "/home/abartho/miniconda3/envs/snakemake/lib/python3.6/importlib/init.py", line 126, in import_module In addition to solving the problem, an example metadata file and some explanation about the output (which I did not see yet) would be good for less experienced users.
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2023.12.03.569767v1
www.biorxiv.org www.biorxiv.org

PEPhub: a database, web interface, and API for editing, sharing, and validating biological sample metadata

3
1. GigaScience 31 Jul 2024
  
  in GigaScience
  
  Background As biological data increases, we need additional infrastructure to share it and promote interoperability. While major effort has been put into sharing data, relatively less emphasis is placed on sharing metadata. Yet, sharing metadata is also important, and in some ways has a wider scope than sharing data itself.Results Here, we present PEPhub, an approach to improve sharing and interoperability of biological metadata. PEPhub provides an API, natural language search, and user-friendly web-based sharing and editing of sample metadata tables. We used PEPhub to process more than 100,000 published biological research projects and index them with fast semantic natural language search. PEPhub thus provides a fast and user-friendly way to finding existing biological research data, or to share new data.
  
  This work has been peer reviewed in GigaScience (see paper), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer name: Weiwen Wang (R1)
  
  The author has addressed most of my concerns, although some issues remain unresolved due to hardware and technical limitations.
2. GigaScience 31 Jul 2024
  
  in GigaScience
  
  Background As biological data increases, we need additional infrastructure to share it and promote interoperability. While major effort has been put into sharing data, relatively less emphasis is placed on sharing metadata. Yet, sharing metadata is also important, and in some ways has a wider scope than sharing data itself.Results Here, we present PEPhub, an approach to improve sharing and interoperability of biological metadata. PEPhub provides an API, natural language search, and user-friendly web-based sharing and editing of sample metadata tables. We used PEPhub to process more than 100,000 published biological research projects and index them with fast semantic natural language search. PEPhub thus provides a fast and user-friendly way to finding existing biological research data, or to share new data.
  
  This work has been peer reviewed in GigaScience (see paper), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer name: Weiwen Wang (original submission)
  
  This manuscript by LeRoy et al. introduces PEPhub, a database aimed at enhancing the sharing and interoperability of biological metadata using the PEP framework. One of the key highlights of this manuscript is the visualization of the PEP framework, which improves the adoption of the PEP framework, facilitating the reuse of metadata. Additionally, PEPhub integrates data from GEO, making it convenient for users to access and utilize. Furthermore, PEPhub offers metadata validation, allowing users to quickly compare their PEP with other PEPhub schemas. Another notable feature is the natural language search, which further enhances the user experience. Overall, PEPhub provides a comprehensive solution that promotes efficient metadata sharing, while leveraging the impact of the PEP framework in organizing large-scale biological research projects.While this manuscript was interesting to read, I have several concerns regarding its "semantic" search system and the interaction of PEPHub.1.
  
  The authors mentioned their use of a tool called "pepembed" to embed PEP descriptions into vectors. However, I was unable to locate the tool on GitHub, and there is limited information in the Method section regarding this. Could the authors provide additional details regarding the process of embedding vectors?2. The authors implemented semantic search as an advantage of PEPhub. However, they did not evaluate the effectiveness of their natural language search engine, such as assessing accuracy, recall rate, or F1 score. It would be beneficial for the authors to perform an evaluation of their natural language search engine and provide metrics to demonstrate its performance. This would enhance the credibility and reliability of their claims regarding the advantages of natural language search in PEPhub.3. It would be more beneficial to include the metadata in the search system rather than solely relying on the project description. For instance, when I searched for SRX17165287 (https://pephub.databio.org/geo/gse211736?tag=default), no results were returned.4. When creating a new PEP, it appears that I can submit two samples with identical values. According to the PEP framework guidelines, it is mentioned that "Typically, samples should have unique values in the sample table index column". Therefore, the authors should enhance their metadata validation system to enforce this uniqueness constraint. Additionally, if I enter two identical values in the sample field and then attempt to add a SUBSAMPLE, an error occurs. However, when I modify one of the samples, I am able to save it successfully.5. The error messages should provide more specific guidance. Currently, when attempting to save metadata with an incorrect format, all error messages are displayed as: "Unknown error occurred: Unknown".6.
  
  PEPhub should consider providing user guidelines or examples on how to fill in subsample metadata and any relevant rules associated with it.7. In the Validation module, what are the rules for validation? Does it only check for the required column names in the schema, or does it also validate the content of the metadata, such as whether the metadata is in the correct format (e.g., int or string)? Additionally, it would be beneficial to provide an option to download the relevant schema and clearly specify the required column names in the schema. This would enable users to better organize their PEP to comply with the schema format and ensure that their metadata is accurately validated.8. This version of PEPHub primarily focuses on metadata. Have the authors considered any plans to expand this database to include data/pipeline management within the PEP framework? It would be valuable for the authors to discuss their future plans for PEPHub in this manuscript.Some minor concerns:1. When searching for content within a specific namespace, it would be beneficial for the pagination bar at the bottom of the webpage to display the number of pages. Now there are only Previous/Next buttons.2. As a web service, it is better to show the supporting browsers, such as Google Chrome (version xxx and above), Firefox (version xxx and above). I failed to open PEPHub website using an old version of Chrome.
3. GigaScience 31 Jul 2024
  
  in GigaScience
  
  Background As biological data increases, we need additional infrastructure to share it and promote interoperability. While major effort has been put into sharing data, relatively less emphasis is placed on sharing metadata. Yet, sharing metadata is also important, and in some ways has a wider scope than sharing data itself.Results Here, we present PEPhub, an approach to improve sharing and interoperability of biological metadata. PEPhub provides an API, natural language search, and user-friendly web-based sharing and editing of sample metadata tables. We used PEPhub to process more than 100,000 published biological research projects and index them with fast semantic natural language search. PEPhub thus provides a fast and user-friendly way to finding existing biological research data, or to share new data.
  
  This work has been peer reviewed in GigaScience (see paper), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer name: Jeremy Leipzig (original submission)
  
  Metadata describes the who, what, where, when, and why of an experiment. Sample metadata is arguably the most important of these, but not the only type. LeRoy et al describes a user-centric sample metadata management system with extensibility, support for multiple interface modalities, and fuzzy semantic search.This system and portal, PEPHub, bridges the gaps between LIMS, which are tightly bound to the wet lab, metadata fetchers like GEOfetch (from the same group) or pysradb, and public portals like MetaSRA and the others listed in . Then and both of which don't allow you to roll your own portal internally, and whose search criteria are not fuzzy or semantic.People have been storing metadata in bespoke databases for decades, but not in an interoperable mature fashion. The PepHUB portal builds on some existing Pep standards by the same group, introducing a restful API and GUI.I find this paper a novel and compelling submission but would like the following minor revisions:1. Typically in SRA a sample refers to a dna sample drawn from a tissue sample (ie BioSample) and then runs describe sequencing attempts on those dna samples, and files are produced from each of the runs. It is unclear to me how someone working in an internal lab using PEPHub would know how to extract the file locations of sequence files associated with a sample if these are many-to-one. In the GEO example provided I can click on the SRX link to see the runs and files but how would this work for an internally generated entry? I need the authors to explain this either as a response or in the text.2. I think the paper has to briefly describe how the authors envision how PEPhub should interface with or replaces a LIMS for labs that are producing their own data and describe how it can help accelerate the SRA submission process for these data generating labs.3. Change "Bernasconi2021" to META-BASE in the text4. Some of the search confidence measures show an absurd level of significant digits (e.g.56.99999999999999% Please round that as it's only used for sorting.
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2023.08.15.551388v2
www.biorxiv.org www.biorxiv.org

Omada: Robust clustering of transcriptomes through multiple testing

6
1. GigaScience 31 Jul 2024
  
  in GigaScience
  
  Cohort studies increasingly collect biosamples for molecular profiling and are observing molecular heterogeneity. High throughput RNA sequencing is providing large datasets capable of reflecting disease mechanisms. Clustering approaches have produced a number of tools to help dissect complex heterogeneous datasets, however, selecting the appropriate method and parameters to perform exploratory clustering analysis of transcriptomic data requires deep understanding of machine learning and extensive computational experimentation. Tools that assist with such decisions without prior field knowledge are nonexistent. To address this we have developed Omada, a suite of tools aiming to automate these processes and make robust unsupervised clustering of transcriptomic data more accessible through automated machine learning based functions. The efficiency of each tool was tested with five datasets characterised by different expression signal strengths to capture a wide spectrum of RNA expression datasets. Our toolkit’s decisions reflected the real number of stable partitions in datasets where the subgroups are discernible. Within datasets with less clear biological distinctions, our tools either formed stable subgroups with different expression profiles and robust clinical associations or revealed signs of problematic data such as biased measurements.
  
  This work has been peer reviewed in GigaScience (see paper), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer name: Casey S. Greene (R2)
  
  The authors describe Omada, which is software to cluster transcriptomic data using multiple methods. The approach selects a heuristically best method from among those tested. The manuscript does describe a software package and there is evidence that the implementation works as described. The manuscript structure was substantially easier for me to follow with the revisions. The manuscript does not have evidence that the method outperforms other potential approaches in this space. It is not clear to me if this is or is not an important consideration for this journal. The form requires that I select from among the options offered. Given that this requires editorial assessment, I have marked "Minor Revision" but I do not feel a minor revision is necessary if, with the present content of the paper, the editor feels it is appropriate. If a revision is deemed necessary, I expect it would need to be a major one.
2. GigaScience 31 Jul 2024
  
  in GigaScience
  
  Cohort studies increasingly collect biosamples for molecular profiling and are observing molecular heterogeneity. High throughput RNA sequencing is providing large datasets capable of reflecting disease mechanisms. Clustering approaches have produced a number of tools to help dissect complex heterogeneous datasets, however, selecting the appropriate method and parameters to perform exploratory clustering analysis of transcriptomic data requires deep understanding of machine learning and extensive computational experimentation. Tools that assist with such decisions without prior field knowledge are nonexistent. To address this we have developed Omada, a suite of tools aiming to automate these processes and make robust unsupervised clustering of transcriptomic data more accessible through automated machine learning based functions. The efficiency of each tool was tested with five datasets characterised by different expression signal strengths to capture a wide spectrum of RNA expression datasets. Our toolkit’s decisions reflected the real number of stable partitions in datasets where the subgroups are discernible. Within datasets with less clear biological distinctions, our tools either formed stable subgroups with different expression profiles and robust clinical associations or revealed signs of problematic data such as biased measurements.
  
  This work has been peer reviewed in GigaScience (see paper), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer name: Casey S. Greene (R1)
  
  The authors have revised their manuscript. They added benchmarking for the method, which is important. The following overall comments still apply - there is not substantial evidence provided for the selections made:
  
  "I found the manuscript difficult to read. It reads somewhat like a how-to guide and somewhat like a software package. I recommend approaching this as a software package, which would require adding evidence to support the choices made. Describe the purpose for the package, evidence for the choices made, benchmarking (compute and performance), describe application to one or more case studies, and discuss how the work fits into the context.
  
  The evaluation includes two simulation studies and then application to a few real datasets; however, for all real datasets the problem is either very easy or the answer is unknown. The largest challenges I have with the manuscript are the large number of arbitrarily selected parameters the limited evidence available to support those as strong choices.
  
  Conceptually, an alternative strategy is to consider real clusters to be those that are robust over many clustering methods. In this case, the best clusters are those that are maximally detectable with a single method. While there exists software for the former strategy, this package implements the latter strategy. It is not intuitively clear to me that this framework is superior to the other for biological discovery. It seems like general clusters (i.e., those that persist across multiple parameterizations) may be the most fruitful to pursue. It would be helpful to provide evidence that the selected strategy has superior utility in at least some settings and a description of how those settings might be identified." It is possible this is not necessary, but I simply note it as I continue to have these challenges with the revised manuscript.
3. GigaScience 31 Jul 2024
  
  in GigaScience
  
  Cohort studies increasingly collect biosamples for molecular profiling and are observing molecular heterogeneity. High throughput RNA sequencing is providing large datasets capable of reflecting disease mechanisms. Clustering approaches have produced a number of tools to help dissect complex heterogeneous datasets, however, selecting the appropriate method and parameters to perform exploratory clustering analysis of transcriptomic data requires deep understanding of machine learning and extensive computational experimentation. Tools that assist with such decisions without prior field knowledge are nonexistent. To address this we have developed Omada, a suite of tools aiming to automate these processes and make robust unsupervised clustering of transcriptomic data more accessible through automated machine learning based functions. The efficiency of each tool was tested with five datasets characterised by different expression signal strengths to capture a wide spectrum of RNA expression datasets. Our toolkit’s decisions reflected the real number of stable partitions in datasets where the subgroups are discernible. Within datasets with less clear biological distinctions, our tools either formed stable subgroups with different expression profiles and robust clinical associations or revealed signs of problematic data such as biased measurements.
  
  This work has been peer reviewed in GigaScience (see paper), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer name: Pierre Cauchy (R1)
  
  Kariotis et al. have efficiently addressed most reviewer comments. Omada, the tool presented there will be of interest to the oncology and bioinformatics communities.
4. GigaScience 31 Jul 2024
  
  in GigaScience
  
  Cohort studies increasingly collect biosamples for molecular profiling and are observing molecular heterogeneity. High throughput RNA sequencing is providing large datasets capable of reflecting disease mechanisms. Clustering approaches have produced a number of tools to help dissect complex heterogeneous datasets, however, selecting the appropriate method and parameters to perform exploratory clustering analysis of transcriptomic data requires deep understanding of machine learning and extensive computational experimentation. Tools that assist with such decisions without prior field knowledge are nonexistent. To address this we have developed Omada, a suite of tools aiming to automate these processes and make robust unsupervised clustering of transcriptomic data more accessible through automated machine learning based functions. The efficiency of each tool was tested with five datasets characterised by different expression signal strengths to capture a wide spectrum of RNA expression datasets. Our toolkit’s decisions reflected the real number of stable partitions in datasets where the subgroups are discernible. Within datasets with less clear biological distinctions, our tools either formed stable subgroups with different expression profiles and robust clinical associations or revealed signs of problematic data such as biased measurements.
  
  This work has been peer reviewed in GigaScience (see paper), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer name: Casey S. Greene (original submission)
  
  The authors describe a system for clustering gene expression data. The manuscript describes clustering workflows (data cleaning, assessing data structure, etc).
  
  I found the manuscript difficult to read. It reads somewhat like a how-to guide and somewhat like a software package. I recommend approaching this as a software package, which would require adding evidence to support the choices made. Describe the purpose for the package, evidence for the choices made, benchmarking (compute and performance), describe application to one or more case studies, and discuss how the work fits into the context.
  
  The evaluation includes two simulation studies and then application to a few real datasets; however, for all real datasets the problem is either very easy or the answer is unknown. The largest challenges I have with the manuscript are the large number of arbitrarily selected parameters the limited evidence available to support those as strong choices. Conceptually, an alternative strategy is to consider real clusters to be those that are robust over many clustering methods. In this case, the best clusters are those that are maximally detectable with a single method. While there exists software for the former strategy, this package implements the latter strategy. It is not intuitively clear to me that this framework is superior to the other for biological discovery. It seems like general clusters (i.e., those that persist across multiple parameterizations) may be the most fruitful to pursue. It would be helpful to provide evidence that the selected strategy has superior utility in at least some settings and a description of how those settings might be identified. I examined the vignette, and I found that it provided a set of examples. I can imagine that running this on larger datasets would be highly time-consuming. It would be helpful to add benchmarking or an estimate of compute time. Given that this seems feasible to parallelize, it might make sense to provide a mechanism for parallelization.
  
  I examined the software briefly. There are some comments. Dead code exists in some files. There is at least one typo in a filename (gene_singatures.R). Some of the choices that seemed arbitrary appear to be written into the software (e.g., get_top30percent_coefficients.R).
5. GigaScience 31 Jul 2024
  
  in GigaScience
  
  Cohort studies increasingly collect biosamples for molecular profiling and are observing molecular heterogeneity. High throughput RNA sequencing is providing large datasets capable of reflecting disease mechanisms. Clustering approaches have produced a number of tools to help dissect complex heterogeneous datasets, however, selecting the appropriate method and parameters to perform exploratory clustering analysis of transcriptomic data requires deep understanding of machine learning and extensive computational experimentation. Tools that assist with such decisions without prior field knowledge are nonexistent. To address this we have developed Omada, a suite of tools aiming to automate these processes and make robust unsupervised clustering of transcriptomic data more accessible through automated machine learning based functions. The efficiency of each tool was tested with five datasets characterised by different expression signal strengths to capture a wide spectrum of RNA expression datasets. Our toolkit’s decisions reflected the real number of stable partitions in datasets where the subgroups are discernible. Within datasets with less clear biological distinctions, our tools either formed stable subgroups with different expression profiles and robust clinical associations or revealed signs of problematic data such as biased measurements.
  
  This work has been peer reviewed in GigaScience (see paper), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer name: **Pierre Cauchy **
  
  Kariotis et al present Omada, a tool dedicated to automated partitioning of large-scale, cohort-based RNA-Sequencing data such as TCGA. A great strength for the manuscript is that it clearly shows that Omada is capable of performing partitioning from PanCan into BRCA, COAD and LUAD (Fig 5), and datasets with no known groups (PAH and GUSTO), which is impressive and novel. I would like to praise the authors for coming up with such a tool, as the lack of a systematic tool dedicated to partitioning TCGA-like expression data is indeed a shortcoming in the field of medical genomics Overall, I believe the tool will be very valuable to the scientific community and could potentially contribute to meta-analysis of cohort RNA-Seq data. I only have a few comments regarding the methodology and manuscript. I also think that it should be more clearly stated that Omada is dedicated to large datasets (e.g. TCGA) and not differential expression analysis. I would also suggest benchmarking Omada to comparable tools via ROC curves if possible (see below). Methods: This section should be a bit more homogeneous between text descriptive and mathematical descriptive. It should specify what parts are automated and what part needs user input and refer to the vignette documentation. I also could not find the Omada github repository. Sample and gene expression preprocessing: To me, this section lacks methods/guidelines and only loosely describes the steps involved. "numerical data may need to be normalised in order to account for potential misdirecting quantities" - which kind of normalisation? "As for the number of genes, it is advised for larger genesets (>1000 genes) to filter down to the most variable ones before the application of any function as genes that do not vary across samples do not contribute towards identifying heterogeneity" What filtering is recommended? Top 5% variance? 1%? Based on what metric? Determining clustering potential: To me, it was not clear if this is automatically performed by Omada and how the feasibility score is determined. Intra-method Clustering Agreement: Is this from normalised data? Because affinity matrix will be greatly affected whether it's normalised or non-normalised data as the matrix of exponential(-normalised gene distance)^2 Spectral clustering step 2: "Define D to be the diagonal matrix whose (i, i)-element is the sum of A's i-th row": please also specify that A(i,j) is 0 in this diagonal matrix. Please also confirm which matrix multiplication method is used, product or Cartesian product? Also if there are 0 values, NAs will be obtained in this step. Hierarchical clustering step 5: "Repeat Step 3 a total of n âˆ’ 1 times until there is only one cluster left." This is a valuable addition as this merges identical clusters, the methods should emphasise that the benefits of this clustering reduction method to help partition data, i.e. that this minimises the number of redundant clusters. Stability-based assessment of feature sets: "For each dataset we generate the bootstrap stability for every k within range". Here it should be mentioned that this is carried out by clusterboot, and the full arguments should be given for documentation "The genes that comprise the dataset with the highest stability are the ones that compose the most appropriate set for the downstream analysis" - is this the single highest or a gene list in the top n datasets? Please specify. Choosing k number of clusters: "This approach prevents any bias from specific metrics and frees the user from making decisions on any specific metric and assumptions on the optimal number of clusters.". Out of consistency with the cluster reduction method in the "intra-clustering agreement" section which I believe is a novelty introduced by Omada, and within the context of automated analysis, the package should also ideally have an optimized number of k-clusters. K-means clustering analysis is often hindered due to the output often resulting in redundant, practically identical clusters which often requires manual merging. While I do understand the rationale described there and in Table 3, in terms of biological information and especially for deregulated genes analysis (e.g. row z-score clustering), should maximum k also not be determined by the number of conditions, i.e 2n, e.g. when n=2, kmax=4; n=3, kmax=8? Test datasets and Fig 6: Please expand on how the number of features 300 was determined. While this number of genes corresponds to a high stability index, is this number fixed or can it be dynamically estimated from a selection (e.g. from 100 to 1000)? Results Overall this section is well written and informative. I would just add the following if applicable: Figure 3: I think this figure could additionally include benchmarking, ROC curves of. Omada vs e.g. previous TCGA clustering analyses (PMID 31805048) Figure 4: I think it would be useful to compare Omada results to previous TCGA clustering analyses, e.g. PMID 35664309 Figure 6: swap C and D. Why is cluster 5 missing on D)?
6. GigaScience 31 Jul 2024
  
  in GigaScience
  
  Cohort studies increasingly collect biosamples for molecular profiling and are observing molecular heterogeneity. High throughput RNA sequencing is providing large datasets capable of reflecting disease mechanisms. Clustering approaches have produced a number of tools to help dissect complex heterogeneous datasets, however, selecting the appropriate method and parameters to perform exploratory clustering analysis of transcriptomic data requires deep understanding of machine learning and extensive computational experimentation. Tools that assist with such decisions without prior field knowledge are nonexistent. To address this we have developed Omada, a suite of tools aiming to automate these processes and make robust unsupervised clustering of transcriptomic data more accessible through automated machine learning based functions. The efficiency of each tool was tested with five datasets characterised by different expression signal strengths to capture a wide spectrum of RNA expression datasets. Our toolkit’s decisions reflected the real number of stable partitions in datasets where the subgroups are discernible. Within datasets with less clear biological distinctions, our tools either formed stable subgroups with different expression profiles and robust clinical associations or revealed signs of problematic data such as biased measurements.
  
  This work has been peer reviewed in GigaScience (see paper), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer name: **Ka-Chun Wong ** (Original submission) The authors have proposed a tool to automate the unsupervised clustering of RNA-seq data. They have adopted multiple testing to ensure the robustness of the identified cell clusters. The identified cell clusters have been validated across different molecular dimensions with sound insights. Overall, the manuscript is well-written and suitable for GigaScience in 2023. I have the following suggestions: 1. It is very nice for the authors to have released the tool in BioConductor. I was wondering if the authors could also highlight it at the end of abstract, similar to the Oxford Bioinformatics style? It could attract citations. 2. The authors have spent significant efforts on validating the identified clusters from different perspectives. However, there are many similar toolkits. Comparisons to them in both time, userfriendliness, and memory requirement would be essential. 3. Since the submitting journal is GigaScience, running time analysis could be necessary to assess the toolkit's scalability performance in the context of big sequencing data. 4. Single-cell RNA-seq data use cases could also be considered in 2023.
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2022.12.19.519427v2
www.biorxiv.org www.biorxiv.org

Multicellular, IVT-derived, unmodified human transcriptome for nanopore-direct RNA analysis

2
1. GigaScience 03 Jul 2024
  
  in GigaByte
  
  Editors Assessment:
  
  Oxford nanopore direct RNA sequencing (DRS) is a relatively new sequencing technology enabling measurements of RNA modifications. In vitro transcription (IVT)-based negative controls (i.e. modification-free transcripts) are a practical and targeted control for this direct sequencing, providing a baseline measurement for canonical nucleotides within a matched and biologically-derived sequence context. This work presents exactly this type of a long-read, multicellular, poly-A RNA-based, IVT-derived, unmodified transcriptome dataset. Review flagging more statistical analyses needed be performed for the data quality, and this was provided. The resulting data providing a resource to the direct RNA analysis community, helping reduce the need for expensive IVT library preparation and sequencing for human samples. And also serving as a framework for RNA modification analysis in other organisms.
  
  This evaluation refers to version 1 and 2 of the preprint
  
  Summary
2. GigaScience 03 Jul 2024
  
  in GigaByte
  
  ABSTRACTNanopore direct RNA sequencing (DRS) enables measurements of RNA modifications. Modification-free transcripts are a practical and targeted control for DRS, providing a baseline measurement for canonical nucleotides within a matched and biologically derived sequence context. However, these controls can be challenging to generate and carry nanopore-specific nuances that can impact analysis. We produced DRS datasets using modification-free transcripts from in vitro transcription (IVT) of cDNA from six immortalized human cell lines. We characterized variation across cell lines and demonstrated how these may be interpreted. These data will serve as a versatile control and resource to the community for RNA modification analysis of human transcripts.
  
  This work has been published in GigaByte Journal under a CC-BY 4.0 license (https://doi.org/10.46471/gigabyte.129), and has published the reviews under the same license. These reviews are as follows:
  
  Reviewer 1. Joshua Burdick
  
  Is the language of sufficient quality?
  
  Yes. In line 284, "bioinformatic" may be more often used than "BioInformatic", but the meaning is clear.
  
  Are the data and metadata consistent with relevant minimum information or reporting standards?
  
  Yes. Presumably the files (e.g. eventalign data) which are not in SRA will need to be uploaded to the GigaByte site.
  
  Is there sufficient detail in the methods and data-processing steps to allow reproduction?
  
  Yes. Line 177 should presumably be "nanopolish evenetalign".
  
  Is there sufficient data validation and statistical analyses of data quality?
  
  Yes. In my opinion, Figure 3(A) nicely illustrates the uncertainty in current nanopore data, which is useful.
  
  Additional Comments:
  
  The RNA samples, and nanopore sequencing data, should be useful as a negative control. Sequencing these IVT RNA samples using the newer ONT RNA004 pore and kit might also be useful.
  
  Reviewer 2. Jiaxu Wang
  
  Is there sufficient data validation and statistical analyses of data quality?
  
  No. The authors ran DSR for the in vitro transcribed transcriptional RNAs from 6 cell lines to remove the possible natural modifications. The data can be used as a control RNA pool for natural or artificial modification studies. however, more statistical analyses should be performed for the data quality. see comments below: (1) For more possible usage of this data, some QC analysis is better to be provided to confirm the quality of these sequencing data. For example: 1) What is the correlation between in vitro transcribed transcriptional RNAs and original DSR for each cell line? 2) how many genes have been captured in each cell line? (2) In Figure 2B, the author provides 3 conditions for ‘exclude’ and ‘include’, some statistical analysis should be performed to confirm how many cases in condition 1, condition 2, and condition 3. How many mismatches are showing in only 1 cell line, some cell lines or all the cell lines? The shared correct genes may be more confident references for the modification analysis. (3) Different reads of the same gene could have different mismatches in the IVT RNAs due to RT-PCR bias or other reasons (especially for the lower expressed RNAs), for example, there are 100 reads in total, 90 reads are the correct nucleotide at a given position, 10 reads have a mismatch in the IVT sample, then how to define the signal as the control reference? Given that the nature modification is low in RNA, some threshold should be applied for the confident result, for example, what is the lowest expression threshold that could be used as a confident control reference?
  
  Is there sufficient information for others to reuse this dataset or integrate it with other data?
  
  No. For more possible usage of this data, more QC data should be performed, please refer to my above comments.
  
  Re-review: I am happy to see the changes. Thanks!
Visit annotations in context

Tags

Summary

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2023.04.06.535889v2
www.biorxiv.org www.biorxiv.org

PhysiCell Studio: a graphical tool to make agent-based modeling more accessible

2
1. GigaScience 01 Jul 2024
  
  in GigaByte
  
  Editors Assessment:
  
  This paper presents a new tool to make using PhysiCell easier, which is an open-source, physics-based multicellular simulation framework with a very wide user base. PhysiCell Studio is a graphical tool that makes it easier to build, run, and visualize PhysiCell models. Over time, it has evolved from being a GUI to include many additional functionalities, and can be used as desktop and cloud versions. This paper outlines the many features and functions, the design and development process behind it, and deployment instructions. Peer review improved the organisation of the various repositories and adding both a requirements.txt and environment.yml files. Looking to the future the developers are planning to add new features based on community feedback and contributions, and this paper presents the many code repositories if readers wish to contribute to the development process.
  
  This evaluation refers to version 1 of the preprint
  
  Summary
2. GigaScience 01 Jul 2024
  
  in GigaByte
  
  AbstractDefining a multicellular model can be challenging. There may be hundreds of parameters that specify the attributes and behaviors of objects. Hopefully the model will be defined using some format specification, e.g., a markup language, that will provide easy model sharing (and a minimal step toward reproducibility). PhysiCell is an open source, physics-based multicellular simulation framework with an active and growing user community. It uses XML to define a model and, traditionally, users needed to manually edit the XML to modify the model. PhysiCell Studio is a tool to make this task easier. It provides a graphical user interface that allows editing the XML model definition, including the creation and deletion of fundamental objects, e.g., cell types and substrates in the microenvironment. It also lets users build their model by defining initial conditions and biological rules, run simulations, and view results interactively. PhysiCell Studio has evolved over multiple workshops and academic courses in recent years which has led to many improvements. Its design and development has benefited from an active undergraduate and graduate research program. Like PhysiCell, the Studio is open source software and contributions from the community are encouraged.
  
  This work has been published in GigaByte Journal under a CC-BY 4.0 license (https://doi.org/10.46471/gigabyte.128), and has published the reviews under the same license. This is part of the PhysiCell Ecosystem Series: https://doi.org/10.46471/GIGABYTE_SERIES_0003
  
  Reviewer 1. Meghna Verma:
  
  Is installation/deployment sufficiently outlined in the paper and documentation, and does it proceed as outlined?
  
  The authors have provided links for video descriptions for installation and that is appreciated.
  
  One overall recommendation is: If all the screenshots (for e.g.: from Fig 1-12 of the main paper and all the subsections in Supplementary) can be combined in one figure that will help enhance the complete overview and the overall flow of the paper.
  
  Additional comments are available here: https://gigabyte-review.rivervalleytechnologies.comdownload-api-file?ZmlsZV9wYXRoPXVwbG9hZHMvZ3gvVFIvNTA3L1Jldmlld19QaHlzaUNlbGxTdHVkaW9fTVYucGRm
  
  Reviewer 2. Koert Schreurs and Lin Wouters supervised by Inge Wortel
  
  Is there a clear statement of need explaining what problems the software is designed to solve and who the target audience is?
  
  The problem statement is addressed in the introduction, which mentions the need for a GUI tool as a much more accessible way to edit the XML-based model syntax. However, it is somewhat confusing who exactly the intended audience of the paper is. Is the paper targeted at researchers that already use PhysiCell, but might want to switch to the GUI version? Or should it (also) target the potential new user-base of researchers interested in using ABMs, for whom the XML version was not sufficiently accessible and who will now gain access to these models because there is a GUI? Specifying the intended audience might impact some sections of the paper. For example, for users who already use PhysiCell, the step-by-step tutorials might not be useful since they would already know most of the available options; they would just need a quick overview of what info is in which tab. But if the paper is (also) targeted at potential new users, then some additional information could make both the paper and the tool much more accessible, such as:
  
  A clear comparison to other modeling frameworks and their functionalities. Why should they use PhysiCell instead of one of the other available (GUI) tools? For example, the referenced Morpheus, CC3D and Artistoo all focus on a different model framework (CPMs); this might be worth mentioning. And what about Chaste? Does it represent different types of models, or are there other reasons to consider PhysiCell over Chaste or vice versa? For new users, this would be important information to include. The paper currently also does not mention other frameworks except those that offer a GUI. While the main point of the paper is the addition of the GUI, for completeness sake it might still be good to mention a broader overview of ABM frameworks and how they compare to PhysiCell, or simply to refer to an existing paper that provides such an overview.
  
  The current tutorial immediately dives into very specific instructions (what to click and exact values to enter), often without explaining what these options mean or do. New users would probably appreciate to get a rough outline of which types of processes can be modelled, and which steps they would take to do so. This could be as easy as summarising the different main tabs before going into the details. I understand that some of these explanations will overlap with the main PhysiCell software – but considering that the GUI will open up modelling to a different type of community, it might make sense to outline them here to get a self-contained overview of functionality.
  
  Indeed, if the above information is provided, the detailed tutorial might fit better as an appendix or in online documentation. That would also leave more space to explain not only which values to enter, but also what these variables do, why choose these values, what other options to consider, etc. Having this information together in one place would be very useful for beginning users.
  
  Is the source code available, and has an appropriate Open Source Initiative license been assigned to the code?
  
  The software is available under the GPL v3 licence.
  
  As Open Source Software are there guidelines on how to contribute, report issues or seek support on the code?
  
  There is a Github repository, ensuring that it is possible to contribute and report issues, and the paper explicitly invites community contributions. However, although the paper mentions that it is possible to seek support through Github Issues and “Slack channels”, we could find no link to the latter resource. This should probably be added to make this resource usable for the reader (or otherwise the statement should be removed)
  
  Is installation/deployment sufficiently outlined in the paper and documentation, and does it proceed as outlined?
  
  Mostly yes, as installation and deployment are outlined in the paper and documentation. However, we did notice a couple of issues: - The studio guide explains how to compile a project in PhysiCell (https://github.com/PhysiCell-Tools/Studio-Guide/blob/main/README.md), but does not mention that Mac users need to specify the g++ version at the top of the Makefile. This is explained in a separate blog (http://www.mathcancer.org/blog/setting-up-gcc-openmp-on-osx-homebrew-edition/) but should be outlined (or at least referenced) here as well. - There are several different resources covering the installation process, referring to e.g. github.com/physicell-training, github.com/PhysiCell-Tools/Studio-Guide, and the abovementioned blog. But this might not be very accessible to all users targeted by the new GUI functionality (especially when command line interventions and manual Makefile edits are involved). While not all of this has to be changed before publication, having all information in one place would already improve accessibility to a larger user-base. - When following the instructions (https://github.com/PhysiCell-Tools/Studio-Guide/blob/main/README.md), “python studio/bin/studio.py -p -e virus-sample” the -p flag gives an error: “Invalid argument(s): [‘-p’]”. We assumed it has to be left out, but perhaps the docs have to be updated.
  
  Is the documentation provided clear and user friendly?
  
  Mostly yes, as there is already a lot of documentation available. However, the user-friendliness could be improved with some minor changes. For example, the documentation could be made more user-friendly if resources were available from a central spot. Currently, information can be found in different places: - https://github.com/PhysiCell-Tools/Studio-Guide/blob/main/README.md provides installation instructions and a nice overview of what is where in the GUI, but as mentioned above, does not mention potential issues when installing on MacOS. - The paper provides very detailed examples; these might be nice to include along with the abovementioned overview. - Potentially other places as well. It would be great if the main documentation page could at least link to these other resources with a brief description of what the user will find there. Further, some additions would make the documentation more complete: - It would be good to have an overview somewhere of all the configuration files that can be supplied/loaded (e.g. those for “rules” and for initial configurations). - A clearer instruction/small tutorial on how to use simularium and paraview with physicell studio; especially for paraview there is no instruction on how to use your own data or make your own `.pvsm` file In the longer term, it might be worthwhile to set up a self-contained documentation website (this is relatively easy nowadays using e.g. Github pages), which can outline dependencies, installation instructions, a quick overview, detailed tutorials, example models, links to Github issues/slack communities. This is not a requirement for publication but might be worth looking into in the future as it would be more user-friendly.
  
  Is there a clearly-stated list of dependencies, and is the core functionality of the software documented to a satisfactory level?
  
  No. The core functionality of the software is nicely outlined in the Github README (https://github.com/PhysiCell-Tools/Studio-Guide/blob/main/README.md), but as mentioned before, this high-level overview is missing in the paper itself. The README and paper recommend installing the Anaconda python distribution to get the required python dependencies. This is fine, but adding a setup file or requirements.txt might still be useful for users who are more familiar with python and want a more minimal installation. Providing a conda environment.yml that allows running the studio along with paraview and/or simularium might also be helpful. Note that running the studio with simularium in anaconda did not work because anaconda did not have the required vtk v9.3.0; instead we had to install simularium without anaconda (“pip3 install simularium”).
  
  Are there (ideally real world) examples demonstrating use of the software?
  
  The detail tutorial nicely walks the reader through the tool (although as mentioned before, a high-level overview is missing and the level of detail feels slightly out of place in the paper itself). When walking through the example in the paper and the supplementary, we did run into a few (minor) issues: - It might be good to stress explicitly that after copying the template.xml into tumor_demo.xml, the first step is always to compile using “make”. The paper mentions “Assuming … you have compiled the template project executable (called “project”) …”. But it might not be immediately clear to all users how exactly they should do so (presumably by running “make tumor_demo” after copying the xml file?). - When running “python studio/bin/studio.py -c tumor_demo.xml -e project” as instructed, a warning pops up that “rules0.csv” is not valid (although the tool itself still works). - The instructions for plotting say to press “enter” when changing cmin and cmax, but Mac offers only a return key. Pressing fn+return to get the enter functionality also does not work; it might be good to offer an alternative for Mac. - When reproducing the supplementary tutorial, results were slightly different. It might be good if the example would offer a random seed so that users can verify that they can reproduce these results exactly. In our hands, when reproducing figs 39, 40, 48, 49 yields way more (red) macrophages (even when running multiple times), but we could not be sure if this is due to variation between runs, or a mistake in the settings somewhere. The paper mentions that they have started setting up automated testing, but it does not give an idea of what the current test coverage is. Did they add a few tests here and there, or start to systematically test all parts of the software? I understand the latter might not be achievable immediately, but it would be good if users and/or contributors can at least get a sense of how good the current coverage is. (Note: the framework uses pytest, which seems to offer some functionality to generate coverage reports, see e.g. https://www.lambdatest.com/blog/pytest-code-coverage-report/). The code in studio_for_pytest.py has a comment “do later, otherwise problems sometimes”, but it is not entirely clear if the relevant issue has been resolved.
  
  Additional Comments: The presented tool offers a GUI interface to the PhysiCell framework for agent-based modeling. As outlined for the paper, this offers significant value to the users since editing a model is now much more accessible. The tool comes with extensive functionality and instructions. Overall, the tool functions as advertised, and will be of great value to the community of PhysiCell users that now have to edit XML files by hand. It is therefore (mostly) publishable as is if some of the issues with installation (mentioned above) can be straightened out. That said, we do think some improvements could make both the tool and the paper more accessible to a larger user audience. Most of these have been mentioned in the other questions, but we will list some additional ones below. Note that many of these are just suggestions, so we will leave it up to the authors if and when they implement them.
  
  Suggestions for the paper: While the paper nicely outlines design ideas and usage of the tool, there were some points where we felt that the main point did not quite come across, for example: - As mentioned in the question about problem statement and intended audience, adding some information to the paper would make it a more useful resource to users not yet familiar with PhysiCell (see remarks there). - The section “Design and development” describes the development history of the tool. In principle this is a valuable addition, because it illustrates how the project is under ongoing development and has already been improved several times based on feedback of users. However, the amount of information on each previous stage is slightly confusing; it is not entirely clear how this relates to the paper and current tool. If the main point is to showcase that the current tool has been built based on practical user experiences, this would probably come across better if this section was somewhat shorter and focused on the design choices rather than previous versions. If the main point is something else, it should be clarified what the main idea is. – The point of Table 1 was unclear to us – consider removing or explaining the main idea. - Several figures do not have captions (e.g. Figure 1 but also others); it would be helpful to clarify what message the figure should convey. – P4 “adjust the syntax for Windows if necessary” – is it self-explanatory how users should adjust? Consider adding the correct code for windows as well if possible, since users that want to use the GUI tool might not be familiar with command line syntax. - P6 “if you create your own custom C++ code referring directly to cell type ID” – this functionality is never discussed. This might be part of the general PhysiCell functionality, but it would be good to at least provide a link to a resource on how you could do this. - P8 “Only those parameters that display … editing the C++ code” – it was not entirely clear to me what this means, could you clarify? - P13 mentions you can immediately see changes to the model parameters made. This is very useful for prototyping when users want immediate feedback. However, what happens when you try to save output for a simulation where parameters were changed while the simulation was running? Would users be reminded that their current output is not representative? - Discussion: it is good to mention that the tool is already being used. Can you give an indication based on your experience how long it takes new users to learn to navigate the tool? This might be useful information to add in the paper. - The last statement on LLMs seems to come out of nowhere. Consider leaving it out or expanding further on what would be needed to make this work/how feasible this is.
  
  Further comments on the tool itelf: - The paper mentions that results may not be fully reproducible if multiple threads are used (I assume this is the case even when a random seed is set). In this case, would it make sense to throw a warning the first time a user tries to set a seed with multiple threads, to avoid confusion as to why the results are not reproducible? - Unusable fields are not always greyed out to indicate that they are disabled, which sometimes makes it seem as though the tool is unresponsive. In other places unusable options are set to grey, so it might be good to double-check if this is consistent. - At the initial conditions (IC) page there is no legend; it might be good to add one. - There are some small inconsistencies between the field names mentioned in the paper and those in the tool/screenshots. For example “boundary condition” (p5) should be “dirichlet BC”, “uptake” (p6) should be “uptake rate”. For the latter, the paper mentions that the length scale is 100 micron but this should be visible in the tool as well. - Not all fields have labels, so it is not always clear what the options do (see e.g. drop-downs in Figure 6). – There are a few points in the tool where you have to “enable” a functionality before it works, but this might not always be intuitive. For example, if you upload a file with initial conditions, it can be assumed that you want to use it. There might be good reasons for this in some cases but in general, consider if all these checkpoints are necessary or if this could be simplified. Same goes for the csv files that have to be saved separately instead of through the main “save” button – in the long term it might be worth saving all relevant files when they are updated, or at least throwing a warning that you have to save some of them separately.
Visit annotations in context

Tags

Summary

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2023.10.24.563727v2
www.biorxiv.org www.biorxiv.org

Low-coverage whole genome sequencing for a highly selective cohort of severe COVID-19 patients

2
1. GigaScience 01 Jul 2024
  
  in GigaByte
  
  Editors Assessment:
  
  Many studies have explored the genetic determinants of COVID-19 severity, these GWAS studies using microarrays or expensive whole-genome sequencing (WGS). Low-coverage WGS data can be imputed using reference panels to enhance resolution and statistical power while maintaining much lower costs, but imputation accuracy is difficult to balance. This work demonstrates how to address these challenges utilising the GLIMPSE1 algorithm, a less resource-intensive tool that produces more accurate imputed data than its predecessors. Generating a dataset containing 79 imputed low-coverage WGS samples from patients with severe COVID-19 symptoms during the initial wave of the SARS-CoV-2 pandemic in Spain. The validation of this imputation and filtering process shows that GLIMPSE1 can be confidently used to impute variants with minor allele frequency up to approximately 2%. After peer review the authors clarified and provided more validation and statistics and figures to help convince this approach was valid. This work showcasing the viability of using low-coverage WGS imputation to generate data for the study of disease-related genetic markers, alongside a validation methodology to ensure the accuracy of the data produced. Helping inspire confidence and encouraging others to deploy similar approaches to other infectious diseases, genetic disorders, or population-based genetic studies. Particularly in large-scale genomic projects and resource-limited settings where sequencing at higher coverage could prove to be prohibitively expensive.
  
  This evaluation refers to version 1 of the preprint
  
  Summary
2. GigaScience 01 Jul 2024
  
  in GigaByte
  
  AbstractDespite advances in identifying genetic markers associated to severe COVID-19, the full genetic characterisation of the disease remains elusive. This study explores the use of imputation in low-coverage whole genome sequencing for a severe COVID-19 patient cohort. We generated a dataset of 79 imputed variant call format files using the GLIMPSE1 tool, each containing an average of 9.5 million single nucleotide variants. Validation revealed a high imputation accuracy (squared Pearson correlation ≈0.97) across sequencing platforms, showing GLIMPSE1’s ability to confidently impute variants with minor allele frequencies as low as 2% in Spanish ancestry individuals. We conducted a comprehensive analysis of the patient cohort, examining hospitalisation and intensive care utilisation, sex and age-based differences, and clinical phenotypes using a standardised set of medical terms developed to characterise severe COVID-19 symptoms. The methods and findings presented here may be leveraged in future genomic projects, providing vital insights for health challenges like COVID-19.
  
  This work has been published in GigaByte Journal under a CC-BY 4.0 license (https://doi.org/10.46471/gigabyte.127 ), and has published the reviews under the same license. For a video summary from the author see: https://youtu.be/x6oVzt_H_Pk?si=Byufhl0mIL3h0K6u
  
  The reviews are as follows:
  
  Reviewer 1. Jong Bhak:
  
  Severe cases of covid-19 patients are critical data. This manuscript deals with detailed clinical information genome set as a subset of exome sequences and provide invaluable data for on-going global covid-19 omics studies.
  
  Reviewer 2. Alfredo Iacoangeli:
  
  The authors present the release of a new dataset that include low coverage WGS data of 79 individuals who experienced severe covid-19 in Madrid (Spain). The authors processed the data and imputed common variants and they are making this dataset available to the scientific community. They also present the clinical data of these patients in a descriptive and informative fashion. Finally, the authors also validated the quantify of their imputation, showcasing the potential of low coverage WGS as an alternative to microarrays. Overall the manuscript is written very well, clear, and exhaustive. The data is certainly valuable. Its generation and processing and analysis appears robust.
  
  Overall I support the publication of this article and dataset. I only have a small number of minor suggestions for the authors: The sentence "Traditionally, the genotyping process has relied on array technologies as the standard, both at the broader GWAS level and the more specific genetic scoring and genetic diagnostics levels" sounds a little off. I totally understand where the authors come from but given the central role of NGS and Sanger for genetic diagnostics I would suggest the authors to modify accordingly or to keep the GWAS focus.
  
  Please double-check the use a statistical terms in the description of the imputed data. For example: "On average, each VCF file in this rich dataset contains 9.49 million high-confidence single nucleotide variants [95%CI: 9.37 million - 9.61 million] (Figure 1)." The use of CI in this context is a little miss-leading as it is not strictly referring to a distribution of probability but to a finite collection. A range would be more appropriate. The authors say that they examined the ethnicity of the 79 individuals, however I do not think the ancestry is actually reported anywhere while a few figures show ancestral population data. The authors might clarify or correct the terminology.
  
  Looking at figure 2 the sentence " although the male age distribution exhibits a broader range and higher variability, suggestive of a greater" does not appear justified. The authors might want to clarify or correct accordingly.
  
  The sentence "This exploratory analysis highlights the diverse ways in which severe COVID-19 can present, and the importance of comprehensive and nuanced clinical phenotyping in improving our understanding and management of the disease." suggests some basic clustering might be useful. The readers might benefit from a couple of graphs or figures quantifying the overlap of the SNPs across samples and maybe one that shows the density of SNPs across the genome.
Visit annotations in context

Tags

Summary

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2024.01.28.577610v1
www.biorxiv.org www.biorxiv.org

Genome evolution and transcriptome plasticity associated with adaptation to monocot and eudicot plants in Colletotrichum fungi

2
1. GigaScience 01 Jul 2024
  
  in GigaScience
  
  this pathogen, coinciding with a progressive shrinking of the degradative arsenal and expansions in lineage specific genes. Comparative transcriptomics of four reference species with different evolutionary histories and adapted to different hosts revealed similarity in gene content but differences in the modulation of their transcription profiles. Only a few orthologs show similar expression profiles on different plant cell walls. Combining genome sequences and expression profiles we identified a set of core genes, such as specific transcription factors, involved in plant cell wall degradation in Colletotrichum.Together, these results indicate that the ancestral Colletotrichum were associated with eudicot plants and certain branches progressively adapted to different monocot hosts, reshaping part of the degradative and transcriptional arsenal.
  
  Reviewer 2: Nicolas Lapalu This manuscript describes the adaptation of the Colletotrichum genus to monocotyledonous and dicotyledonous plants with regard to the content and expression of genes from 30 genomes, with a subsampling of 4 genomes for transcriptomic analyses. Major remarks: "Considering that the analyses carried out are affected by the sampling, as closely related species are likely to have more shared genes compared to species that are more distant from others," Yes, Indeed, it's clearly a possible bias due to the sampling, as you write. As you considered all genomes together to define specific genes, monocot specific species have few specific genes due to their phylogenetic proximity. Based on this, could you address these observations based on combination of figure 1 and 2: 1. The number of specific genes in C.eremochloae (1608) vs in C.sublineola (1643), while divergence time between both seems short and similar to the group of C.lupini, C.costaricense (monoct) … with approximatively 100 genes specific to each species. How could such closely related genomes have acquired so many specific genes in such a short time compared with other species during the same period of evolution? 2. Same remark for in C.phormii (911) vs C.salicis (286), when it's even more disturbing with the switch to dicot and a loss of many genes for C.salicis. For both cases mentionned above, a detailed comparison between the two genomes could be useful to obtain some explanations of the events and genes involved. Moreover, interpretation of the phylogenetic tree (Figure 1), could be lead to propose three clusters of genomes, based on evolution time and plant host: Monocot, Dicot "old" (C.orbiculare, C.noveboracense, …) and Dicot "young" (C.melonis, C.cuscutae, …). Did the authors attempt an analysis with a such view of the data? Maybe that will complete the view of C.acutatum complex (46 genes) vs C.graminciola complex (28 genes) form which C.orchidophylum and C.phormis are excluded. Finally, one of the most interesting thing is the proximity of C. phormii and C.salicis in the same clade but with a recent host specialization. Despite the poor quality of the genome of C.salicis vs C.phormii, an comparative genomic approach with a tool like Synchro could provide clues as to gene losses and their location (all along the genome/ specific regions ). Figure 3: Please explain further Figure 3 A, described as a PCA. No axis (dimension) has been shown with a % explaining the divergence between organisms. This is confusing and does not allow me to know whether the gene sets used to compare the 4 genomes are only shared genes or all genes. The rest of the figure is much clearer and the comments are clear on the response to species specificity (under/over expression of genes) for each genome. Figure 4: "the expression of the orthologous genes was clustered for the four fungal species (Figure 4A)" As written, it is assumed that you used ortholog genes established between the 4 species, this does not appear to be the case with so many genes missing in C.graminicola in figure 4. To continue on this point, I have not found the minimum number of species found in a cluster to set a cluster of orthologs (maybe written but not found). What is the threshold for divergence or sequence similarity? Have you considered sequence length (query coverage vs subject coverage) to allow clustering of potentially split/fragmented genes in annotations? Minor remarks: The authors limit their analysis to 30 genomes, whereas more than 270 genomes of Colletotrichum are available, from over 70 species. Research time is clearly longer than the time to generate genomic resources, but it could be interesting to list a few new genomes missing from those analysed and that could have significant added value (particularly if sequenced in long reads, providing complete genomes). Transcriptomic analyses were carried out on 4 genomes. The choice of the genomes was not discussed, and maybe done by convenience with strains available at the lab. In fact, C.higginsianum is well sequenced, assembled and studied and chosen as one of the specific hosts of dicotyledons, whereas it is a member of the C.destructivum complex. Similarly, C.phormii appears to be a recent species with an adaptation to monocots. L 113 : "species with bigger genomes are characterized by a lower GC content", please rewrite the link between genome size and GC content. Between species of same genus genome size is most often linked to the invasion of TE element (RIPed or not in fungi). Strongly ripped genomes (Leptosphaeria, Venturia) are not always large compared to the size of other species. Data availability: All genomes were released in public Databases. I do not find accession numbers for RNA-Seq runs. Many supplementary details have been provided. I appreciate the BUSCO logs for checking the completeness of gene sets, which provide me some clues about the quality of genome annotation, that was never discussed or pointed out in the manuscript as possible source of bias. Overall, the manuscript is very interesting and confirms the results previously identified in terms of specificities of CAZy families associated with host plant adaptation in the Colletotrichum genus. The authors demonstrate a great knowledge of the CAZome and associated biological processes, which provides a great deal of valuable information for the community working on Colletotrichum and more generally for all those working on such enzymes. Finally, the transcriptomic data suggest that species specificity and host adaptation are more related to an expression pattern than to specific gene content, than a specific gene content.
2. GigaScience 01 Jul 2024
  
  in GigaScience
  
  Colletotrichum fungi infect a wide diversity of monocot and eudicot hosts, causing plant diseases on almost all economically important crops worldwide. In addition to its economic impact, Colletotrichum is a suitable model for the study of gene family evolution on a fine scale to uncover events in the genome that are associated with the evolution of biological characters important for host interactions. Here we present the genome sequences of 30 Colletotrichum species, 18 of them newly sequenced, covering the taxonomic diversity within the genus. A time-calibrated tree revealed that the Colletotrichum ancestor diverged in the late Cretaceous around 70 million years ago (mya) in parallel with the diversification of flowering plants. We
  
  Reviewer 1: Jamie McGowan In this study, Baroncelli and colleagues carry out a comprehensive analysis of genomic evolution in Colletotrichum fungi, an important group of plant pathogens with diverse and economically significant hosts. Their comparative genomic and phylogenomics analyses are based on the genome sequences of 30 Colletotrichum species spanning the diversity of the genus, including pathogens of dicots, monocots, and both dicots and monocots. This includes 18 genome sequences that are newly reported in this study. They also perform comparative transcriptomic analyses of 4 Colletotrichum species (2 dicot pathogens and 2 monocot pathogens) on different carbon sources. Overall, I thought the manuscript was very well written and technically sound. The results should be of interest to a broad audience, particularly to those interested in fungal evolutionary genomics and plant pathology. I only have a few minor comments. Minor comments: (1) Lines 50 - 51: "The plant cell wall (PCW) consists of many different polysaccharides that are attached not only to each other through a variety of linkages providing the main strength and structure for the PCW". I found this confusing - is the sentence incomplete? (2) Line 66: "Some Colletotrichum species show…" I think there should be a couple of introductory sentences about Colletotrichum before this. (3) Figure 1: It would be informative to label which genomes were sequenced with PacBio versus just Illumina. (4) Lines 254 - 255: "As no other enrichment was identified we performed a manual annotation of genes identified in Figure 3D". I don't think it is clear here what manual annotation this is referring to. (5) One area where I felt the analysis was lacking was the lack of analyses on genome repeat content. The authors highlight the large variation in genome sizes within Colletotrichum species (~44 Mb vs ~90 Mb) and show in Figure 1 that this correlates with increased non-coding DNA. It would have been interesting to determine if this is driven by the proliferation of particular repeat families. (6) Another concern is the inconsistent use of genome annotation methods. 12 of the genomes reported in this study were annotated using the JGI annotation pipeline, whereas the other 6 were annotated using the MAKER pipeline. Several studies (e.g., Weisman et al., 2022 - Current Biology) show that inconsistent genome annotation methods can inflate the number of observed lineage specific genes. The authors may wish to comment on this or demonstrate that this isn't an issue in their study (e.g., by aligning lineage specific proteins against the other genome assemblies).
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2022.09.22.508453v1
www.biorxiv.org www.biorxiv.org

Exploring polymorphic interspecies structural variants in Eucalyptus: Unravelling Their Role in Reproductive Isolation and Adaptive Divergence

2
1. GigaScience 01 Jul 2024
  
  in GigaScience
  
  respectively. Focusing on inversions and translocations, symmetric SVs which are readily genotyped within both populations, 24 were found to be structural divergences, 2,623 structural polymorphisms, and 928 shared structural polymorphisms. We assessed the functional significance of fixed interspecies SVs by examining differences in estimated recombination rates and genetic differentiation between species, revealing a complex history of natural selection. Shared structural polymorphisms displayed enrichment of potentially adaptive genes.
  
  Reviewer 2: Lejun Ouyang Structural variation plays an important role in the domestication and adaptability of species. The author compared the structural variation between E. melliodora and E. sideroxylon populations. This is a very interesting study, but it feels that the author is just statistical data. However, the biological problems caused by these differences have not been condensed, such as the impact of structural variation on recombination. What effect does it have on the differentiation of the two populations? Is it promoting or inhibiting? Secondly, the author's writing is not very clear, and some of the results are described too simply, resulting in unclear conclusions. When formatting pictures, try to avoid nesting pictures, and use A, B, C, etc. to represent them. However, some obvious issues, but not limited, are listed above. Here are other minor issues: 1. Lines 62-64: References are required. 2. Lines 145-150: It is recommended to put it in the materials and methods section. 3. The Synteny and structural variation annotation section requires a detailed explanation of the results in Figure 2 and Table 2. 4. It is recommended to make Table 2 into a picture, the effect will be better. 5. The form should be a three-line grid. 6. Why does the recombination rate in Table 3 have positive and negative errors at the genome level, but only negative errors at the chromosome average level? 7. 219-220 It is recommended that methods not appear in the results section. It is recommended to put it in the methods section. 8. The Structural variation genotyping in the results section needs to be modified. 9. Figure 6 is a bit confusing. It is recommended to revise it to make it clearer. 10. The results section of Figure 7 is not clearly described and the notes are not clear. What do the different colors represent? 11. Lines 263-264: It is recommended that methods should not appear in the results section, but can be placed in the materials and methods section. 12. It is recommended that Figure 8 be divided into Figure 8A and Figure 8B. Try not to have pictures within pictures, which can easily lead to unclear references. 13. Lines 276-281: It is recommended to put it in the method section. 14. Lines 289-290: It is recommended to put it in the method section. 15. Lines 307-308: E. melliodora and E. sideroxylon italics 16. Lines 311-318, lines 320-321: It is recommended to put them in the method section. 17. Lines 338-339: E. melliodora and E. sideroxylon italics. 18. Line 342: It is recommended to put it in the discussion. 19. It is recommended to change Figure 9B, Figure 10B and Figure 11B to Figure 20. Line 561: Add references.
2. GigaScience 01 Jul 2024
  
  in GigaScience
  
  Structural variants (SVs) play a significant role in speciation and adaptation in many species, yet few studies have explored the prevalence and impact of different categories of SVs. We conducted a comparative analysis of long-read assembled reference genomes of closely related Eucalyptus species to identify candidate SVs potentially influencing speciation and adaptation. Interspecies SVs can be either fixed differences, or polymorphic in one or both species. To describe SV patterns, we employed short-read whole-genome sequencing on over 600 individuals of E. melliodora and E. sideroxylon, along with recent high quality genome assemblies. We aligned reads and genotyped interspecies SVs predicted between species reference genomes. Our results revealed that 49,756 of 58,025 and 39,536 of 47,064 interspecies SVs could be typed with short reads, in E. melliodora and E. sideroxylon
  
  Reviewer 1: Jakob Butler Ferguson et al have performed a thorough analysis of two species of Eucalyptus, quantifying the extent of structural variation between assembled genomes of the species and determining how prevalent those variations are across a selection of wild material. I believe this study is of sufficient quality for publication in GigaScience, if some minor inconsistencies and grammatical issues are addressed, and a few supporting analyses are performed. The major changes I would like to see include the addition of a syri plot of the complete set of SVs between E. melliodora and E. sideroxylon. I believe this, along with correcting the scale on the plots of recombination in Figure S6/S7 would allow for a better comparison of how recombination rate is interacting with the SVs. I would also suggest a more formal test of enrichment for COG terms, to better support the statements of "enrichment" in the discussion. Suggested changes by line: Line 142 - This section is quite short, I would either merge this section into the Genome scaffolding (and annotation) section, or expand on the results of the gene annotation. Line 182 - (Supplementary Figure S4) Line 183 (and throughout) - Please be consistent with your references to tables and figures. Line 186 - delete comma after 28.63% Line 194 - These are density plots rather than histograms Figure 4 - Both axes are labelled as PC1 Line 217 (page 10, line numbers are doubled up) - This seems repetitive, perhaps "…especially as they may also represent divergent sequences". Line 221 (page 11) - Please insert "and" before polymorphic translocations Line 223 - You have stated that those not successfully genotyped in both species are private or artefacts earlier in the paragraph, please reduce the repetition. Figure 6 - I don't find this figure particularly informative (and somewhat confusing to interpret). I think showing the percentages of each different SV in a visual form implies a level of equivalence in genomic impact, which is difficult to reconcile with the raw difference in numbers. I think a supplemental table with the focus on the percentages would illustrate the point better. Line 246 - There is no mention in the methods about what r threshold was used to declare a pair "correlated", please state it here or in the methods. Line 265 - This line was confusing to interpret. A suggested alteration: "significant value. After attempting to functionally annotating all genes across the genome and placing them within COG categories, 247 of the total 281 gene candidates in SSPs were annotated. These genes were enriched for...." Line 266 - I would like to see a formal enrichment analysis rather than "increased/decreased association", so we could have a clearer picture of which gene functions are truly over/underrepresented in SSPs. You could subsequently limit Figure 8 to those that show a difference. Line 275 - The grammar of this title is a bit off, perhaps "Effect of syntenic, rearranged, unaligned regions and genes on recombination rates" Line 276 - This is the first mention of p, please define it as recombination rate Line 283 - The supplemental Figure S6 and S7 seem to have regions of heightened recombination, but this is difficult to interpret and compare with the current variable axis scales. Please make these consistent. I would also like to see the syri graph of the two aligned genomes, as this would allow for a visual comparison of SV regions with recombination rate. Line 290 - How were p-values adjusted? Line 294 - More information about this 'significantly' higher recombination rate would be good, either in the figure or further expanded in the text. Line 307 - Italics for species names (repeated in Figure 10 and Figure 11 caption) Line 310 - Similar problem to line 275 Figure 10 - Having Figure 9b repeated in Figure 10 and Figure 11 is unnecessary. Line 336 - Vertical lines show average FST, not p Line 341 - Similar problem to line 275 Line 356 - translocations should be plural Line 367 - Vertical lines show average SNP density, not p Line 391 - This is the first mention of barrier loci, please define Line 413 - As mentioned above, I would recommend a formal enrichment test to support this statement Line 428 - Grammar is poor here, please correct Line 490 - Please make this a complete sentence Line 499 - Please state how the Hi-C map was manually edited, and what informed the position of those edits. Line 508 - Please provide an example of how well your LAI score of ~18 compares. The LAI paper seems to intimate that 10 is low quality? Line 513 - Missing bracket for version number Line 536 - Syntenic rather than synteny Line 717 - Formatting error in references Supp table S3-S4-S5 - Space between E. and sideroxylon
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2023.10.20.563207v1
www.biorxiv.org www.biorxiv.org

LRTK: A platform agnostic toolkit for linked-read analysis of both human genomes and metagenomes

3
1. GigaScience 01 Jul 2024
  
  in GigaScience
  
  Python-based workflow for processing linked-read sequencing datasets, thereby filling the current gap in the field caused by platform-centric genome-specific linked-read data analysis tools.
  
  Reviewer 3: Dmitrii Meleshko The paper titled "LRTK: A Platform-Agnostic Toolkit for Linked-Read Analysis of Both Human Genomes and Metagenomes" by Yang et al. is dedicated to the development of a unified interface for linked-read data processing.The problem described in the paper indeed exists; each linked-read technology requires complex preprocessing steps that are not straightforward or efficient. The idea of consolidating multiple tools in one place, with some of them modified to handle multiple data types, is commendable. Overall, I am supportive of this paper. My main concern, however, is that the impact of linked-read applications in the paper appears to be exaggerated, and the authors need to provide more context in their presentation. Also, some parts of the paper are vague described. I will elaborate on my concerns in more detail below.X) Linked-read sequencing generates reads with high base quality and extrapolative 64 information on long-range DNA connectedness, which has led to significant 65 advancements in human genome and metagenome research[1-3]. - Citations 1-3 do not really tell about advancements in human genome and metagenome research, these are technologies papers. Similar problem can be found in "Despite the limitations that genome specificity…" paragraph. Authors cited and described several algorithms, that are not really genomic studies. E.g. "stLFR[2] has found application in a customized pipeline that has been developed to first convert its raw reads into a 10x-compatible format, after which Long Ranger is applied for downstream analysis." is not an example of genomic study, but a pipeline description.X) Table S1 does not improve the paper, I would say it does completely the opposite. LongRanger is not a toolkit, it should be considered as read alignment tool that outputs some SVs and haplotypes along the way. So LongRanger vs LRTK comparison does not make sense to me. There are other tools that solve metagenome assembly problem, human assembly problem, call certain classes of SVs etc.x) I think incorporating longranger is important, since its performance is reported to be better than EMA for human samples and it is also more popular than EMA. Is it possible and have you tried doing it?x) I would remove exaggerations such as "myriad" from the text. The scope of linked-reads is pretty limited nowadays. I agree that linked-reads might be useful in metagenomics/transcriptomics and other scenarios that were mentionedin the text, but the number of studies is very limited especially nowadays, and was not really big when 10X platform was on the risex) "LRTK reconstructs long DNA fragments" - when people talk about long fragment reconstruction, they usually mean moleculo-style reconstruction through assembly. This reconstruction resemble "barcode deconvolution", described in Danko et al, and Mak et al. So I would stick to this terminologyx) it is important to note that, Aquila, LinkedSV and VALOR2 are linked-read specific tools, while FreeBayes, Samtools and GATK are short-read tools. Also, provide target SV length for both groups of tools.x) There are some minor problems with Github readme. E.g. "*parameters". Also, I don't understand how to use conversion in real life… E.g. 10X Genomics data often comes as a folder with multiple gzipped R1/R2/I1 files. I don't understand how would I use it in that case.x) Please cite or explain why this is happening (not only when) - "A known concern with stLFR linked-read sequencing is the loss of barcode specificity during analysis."x) I don't understand what is "Length-weighted average (Î¼FL) and unweighted average (WÎ¼FL) of DNA 688 fragment lengths." from the figure. One of them is just an average and what about second? Figure looks confusingx) LRTK supports reconstruction of long DNA fragments - this section describes something else. More about statistics and data QCx) LRTK promotes metagenome assembly using barcode specificity - please remove supernova, it was never a metagenomic assembler. Check cloudSPAdes insteadx) "The superior assembly performance we have observed" - superior compared to what? If so, some short-read benchmark should be included.x) "LRTK improves human genome variant phasing using long range information" - What dataset is this? What callset was used for ground truth? Briefly describe how comparisons were done?x) Figures 5F-G together are very confusing.First I don't expect tools like LinkedSV to have high recall (around 1.0) and low precision. Also, figure G is kind of subset of figure F, but results are completely different. Also use explicit notation. E.g. 50-1kbp and 1-10kbp mean completely different thingsx) We curated one benchmarking dataset and two real datasets to demonstrate the 307 performance of LRTK - what do you mean by "curation" herex) Why don't you use Tell-Seq barcode whitelist mentioned here - https://sagescience.com/wpcontent/uploads/2020/10/TELL-Seq-Software-Roadmap-User-Guide-2.pdfx) Tiered alignment approach is vaguely introduced. It is not clear what "n% most closely covered windows." mean, or how do we select a subset of reference genomes for the second phase
2. GigaScience 01 Jul 2024
  
  in GigaScience
  
  benchmarking and three real linked-read data sets from both the human genome and metagenome. We showcase LRTK’s ability to generate comparative performance results from the preceding benchmark study and to report these results in publication-ready HTML document plots. LRTK provides comprehensive and flexible modules along with an easy-to-use
  
  Reviewer 2: Lauren Mak Summary: This manuscript describes the need for a generalized linked-read (LR) analysis package and showcases the package the authors developed to address this need. Overall, the workflow is welldesigned but there are major gaps in the benchmarking, analysis, and documentation process that need to be addressed before publication.Documentation:The purpose of multiple tool options: While the analysis package is technically sound, one major aspect is left unexplained- why are there so many algorithm options included without guidance as to which one to use? There are clearly performance differences by different algorithms (combinations of 2+ not considered either) on different types of LR sequence.Provenance of ATCC-MSA-1003: Nowhere in the manuscript is the biological and technical composition of the metagenomics control described. It would be helpful to mention that this is specifically a mock gut microbiome sample, as well as the relative abundances of the originating species as well as the absolute amounts of genetic material per species (ex. as measured by genomic coverage) in the actual dataset. As a corollary, there should be standard deviations in any figures that display a summary statistic (ex. Figure 3A- precision, recall, etc.) that seems to be averaged across the species in a sample. This includes Figure 3A and Figure 4A.Dataset details: There is no table indicating the number of reads for each dataset, which would be helpful in interpreting Figures 3 and 4.Open source?: However, there was no Github link provided, only a link to the Conda landing page. Are there thorough instructions provided for the package's installation, input, output, and environment management?Benchmarking:The lack of simulated tests: The above concern (expected performance on idealized datasets) is best addressed with simulated data, which was not done despite the fact that LRSim exists (and apparently the authors have written a tool for stLFR as well previously).Indels: What are the sizes of the indels detected? Why were newer tools, such as PopIns2, Pamir, or Novel-X not tried as well?Analysis:Lines 166-169: Figure 1 panel A1 vs. B1- why do the distribution of estimated fragment sizes from the 10x datasets look so different in metagenomic vs. human samplees, when there is reasonable consistency in TELL-Seq and stLFR datasets?Lines 182-184: Figure 3A- why is LRTK's taxonomic classification quality generally lower than the of the tools? At least in terms of recall, it should perform better as mapping reads to reference genomes should have a lower false negative rate than k-mer-based tools. Also, what is the threshold for having detect a taxon? Is it just any number of reads or is there a minimum bound?Lines 187-188: Figure 3B- at least 15% of each caller's set of variants is unique to the variant, while a maximum of 50% is universal. I'd not interpret that as consistency.Lines 192-193: Are you referring to allelic imbalance as it is popularly used to refer to expression variation between the two haplotypes of a diploid organism? This clearly doesn't apply in the case of bacteria. If this is not what you're referring to, please define and/or cite the applicable definition.Lines 201-208: It's odd that despite the 10x datasets having the largest estimated fragment size, they have some of the smallest genome fractions, NGA50, and NA50. Why is this? Are they just smaller datasets, on average?Miscellaneous:UHGG: Please mention the fact that the UHGG is the default database, as well as whether or not the user will be able to supply their own databases.Line 363: What does {M} refer to?Line 369: What does U mean here? Is this the number of uniquely aligned reads in one of the windows N that a multi-aligned read aligns to?Lines 371-372: What does 'n% most closely covered windows' refer to?Lines 399-405: How are SNVs chosen for MAI analysis from the three available SNV callers?Lines 653-656: Which dataset was used for quality evaluation?Line 665: What do the abbreviations BAF and T stand for?
3. GigaScience 01 Jul 2024
  
  in GigaScience
  
  Linked-read sequencing technologies generate high base quality reads that contain extrapolative information on long-range DNA connectedness. These advantages of linked-read technologies are well known and has been demonstrated in many human genomic and metagenomic studies. However, existing linked-read analysis pipelines (e.g., Long Ranger) were primarily developed to process sequencing data from the human genome and are not suited for analyzing metagenomic sequencing data. Moreover, linked-read analysis pipelines are typically limited to one specific sequencing platform. To address these limitations, we present the Linked-Read ToolKit (LRTK), a unified and versatile toolkit for platform agnostic processing of linked-read sequencing data from both human genomes and metagenomes. LRTK provides functions to perform linked-read simulation, barcode error correction, read cloud assembly, barcode-aware read alignment, reconstruction of long DNA fragments, taxonomic classification and quantification, as well as barcode-assisted genomic variant calling and phasing. LRTK has the ability to process multiple samples automatically, and provides the user with the option to generate reproducible reports during processing of raw sequencing data and at multiple checkpoints throughout downstream analysis. We applied LRTK on two
  
  Reviewer 1: Brock Peters Yang et al. describe a package of tools, LRTK, for cobarcoded reads (linked reads) agnostic of library preparation methods and sequencing platforms. In general, it appears to be a very useful tool. I have a few concerns with the manuscript as it is currently written:1. Line 203 "With Pangaea,LRTK achieves NA50 values of 1.8 Mb and 1.2 Mb for stLFR and TELL-Seq sequencing data, respectively. On 10x Genomics sequencing data, Athena exhibited superior assembly performance, with a NGA50 of 245 Kb."This is a bit of an awkward two sentences as you are comparing NA50 values for stLFR and TELL-Seq and then NGA50 for 10X Genomics and it makes it sound like 10X Genomics performed the best. Also, these numbers don't seem to agree with the figure.2. How long does an average run take to process? Say a 35X human genome coverage sample? Are there requirements for memory? A figure and metrics around this sort of thing would be helpful.3. How much data was used per library? What was the total coverage? Was the data normalized to have the same coverage per library? If not, it's very difficult to make fair comparisons between the different technologies.4. There's a section on reconstruction of long fragments, but then there really isn't any evaluation of this result and it's not clear if these are even used for anything. For all of these sequencing types I would assume that you can't really do much in the way of seed extension since the coverage across long fragments for these methods is much less than 1X. I think this needs to be developed a little more or it needs to be explained how these are used in your process or you just need to say you didn't use them for anything but here's some potential applications they could be used for. What type of file is output from this process? I think it's interesting, but just not clear how to use this data.5. I did try to install the software using Conda, but it failed and it's not clear to me why. Perhaps it's something about my environment, but you might want to have some colleagues located in different institutions try to install it to make sure it is easy to do so.
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2022.08.10.503458v2
www.biorxiv.org www.biorxiv.org

Genomic decoding of Theobroma grandiflorum (cupuassu) at chromosomal scale: Evolutionary insights for horticultural innovation

2
1. GigaScience 01 Jul 2024
  
  in GigaScience
  
  Results The cupuassu genome spans 423 Mb, encodes 31,381 genes distributed in the ten chromosomes, and it exhibits approximately 65% gene synteny with the T. cacao genome, reflecting a conserved evolutionary history, albeit punctuated with unique genomic variations. The main changes are pronounced by bursts of long-terminal repeats retrotransposons expansion at post-species divergence, retrocopied and singleton genes, and gene families displaying distinctive patterns of expansion and contraction. Furthermore, positively selected genes are evident, particularly among retained and dispersed, tandem and proximal duplicated genes associated to general fruit and seed traits and defense mechanisms, supporting the hypothesis of potential episodes of subfunctionalization and neofunctionalization following duplication, and impact from distinct domestication process. These genomic variations may underpin the differences observed in fruit and seed morphology, ripening, and disease resistance between cupuassu and the other Malvaceae species.
  
  Reviewer 2: Jian-Feng Mao Rafael et al. contributed their study, "Genomic decoding of Theobroma grandiflorum (cupuassu) at chromosomal scale:Evolutionary insights for horticultural innovation". In this study, high-quality genome assembly for an important plant was generated and the authors further investigated genome characterization, genome evolution, gene families etc. The data quality is high, though some points need to be clarified. And the reported data and investigations could provide valuable inference for following studies.This paper is generally well-prepared.Major comments:1. Quality control of genome assembly. The quality of genome assembly could be better evaluated with more stringent parameters. On assembly quality control, I will recommend to always follow criteria established in Earth Biogenome Project (Report on Assembly Standards, https://www.earthbiogenome.org/assembly-standards). Please evaluate the present assemblies with the criteria from EBP project, I think, on at least some if not all the items. At least, I think Merqury results would be very informative.Merqury: reference-free quality, completeness, and phasing assessment for genome assemblies https://genomebiology.biomedcentral.com/articles/10.1186/s13059-020-02134-92. Gaps in each pseudo-chromosome. Not clear if gaps are still remained, or the genome is of gap-free?3. Centromere region. How centromeres were identified? Centromeres were shown, but no description on how you did identify them. Given the high quality of genome assembly, it would be very interesting to incorporate the investigation into distribution of centromeres. A pipeline (https://github.com/ShuaiNIEgithub/Centromics (identifying centromere with multi-oimcs data, such as repeat profiling, and Hi-C chromatin contact) is helpful, and it was generally described at https://academic.oup.com/hr/article/10/1/uhac241/6775201?login=true) has already been prepared and widely applied in data analyses in some just published T2T assemblies.
2. GigaScience 01 Jul 2024
  
  in GigaScience
  
  Background Theobroma grandiflorum (Malvaceae), known as cupuassu, is a tree indigenous to the Amazon Basin, valued for its large fruits and seed-pulp, contributing notably to the Amazonian bioeconomy. The seed-pulp is utilized in desserts and beverages, and its seed butter is used in cosmetics. Here, we present the sequenced telomere-to-telomere cupuassu genome, disclosing features of the genomic structure, evolution, and phylogenetic relationships within the Malvaceae.
  
  Reviewer 1: Xupo Ding 1. The Line or page number should be added in the revised manuscript, it is hard to point the comment to definite line.2. The methods and parameters of TE analysis should be detailed in the main text or supplementary file, especially for the LAI calculation, the LAI output by our pipeline is 11.47 and the pipeline was built according to default parameters of LTR_retiever (https://github.com/oushujun/LTR_retriever).3. What was the mutation rate (r) used for TE insert time calculation? If the insertion time were from the original files of EDTA, please notice that the default r is 1.3e-8 of grass family once --u was not set with promoting EDTA, that should be converted with the correct r value.4. Generally, the Gypsy content was usually more than Copia content in plant genome, please check it. If it were correct, please infer the reason.5. All results of GO enrichment were better enriched with KEGG.6. The results about enrichment were wrote hastily, lots of GO function or GO numbers were just list, the details should be abundant. Cite the Figures or tables or references in these sections.7. In Figure 1C, the Ks distribution need corrective, the authors can refer the polyploidization of durian genome published on Plant physiology in 2019.8. In Figure 2C, why some orders of TE loss the SD?9. In Figure 3A, T. grandiflorum and T. cacao present highly syntenic at gene level, the software of Liftoff might detect extra genes to T. grandiflorum genome based on the T. cacao genome. This is just a suggestion.10. In Figure 5A, there were 282 special genes in T. grandiflorum, please enrichment them with GO and KEGG.11. Figure 5B and D were from the GO enrichment the GO numbers should be added around annotation or list them in the supplementary files.12. In Figure 5C, the confidence interval of divergence time should be added.13. In the data availability, the weblink is not for everyone, GigaDB will record your data, so the unopened weblink might not necessary.14. In the MS, disease resistance were mentioned repeatedly, the GO enrichment has been provided some evidence, it will be better to perform the KEGG analysis with the special genes and expanded or contracted genes to verify, especially stat the changes in the ko04626.15. The language must be improved and modified by naive academic English speaker.
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2023.12.23.573069v1
www.biorxiv.org www.biorxiv.org

Hecatomb: An End-to-End Research Platform for Viral Metagenomics

2
1. GigaScience 01 Jul 2024
  
  in GigaScience
  
  Results Here we introduce Hecatomb, a bioinformatics platform enabling both read and contig based analysis. Hecatomb integrates query information from both amino acid and nucleotide reference sequence databases. Hecatomb integrates data collected throughout the workflow enabling analyst driven virome analysis and discovery. Hecatomb is available on GitHub at https://github.com/shandley/hecatomb.
  
  Reviewer 2: Satoshi Hiraoka In this manuscript, the authors developed a novel pipeline, Hecatomb, for viral genome analysis using metagenome and virome data that accepted both short- and long-read sequencing data. Using the pipeline, the authors performed the analysis using one virome and one metagenome dataset from different environments (stool and coral reef, respectively). The analyses showed reasonable results according to the original studies and rather they discovered candidate novel phages and new findings that possibly have great insight into the microbial ecology. The manuscript is overall informative and well-written. The Hecatomb incorporates famous bioinformatics tools that are frequently used in viral genome analyses today, allowing many researchers including beginners to examine virome datasets easily and effectively. Thus the pipeline is likely valuable and would contribute to wide studies of viruses, most of which are not cultured and its characteristics are unknown. Noteworthy, there is an informative document page ( https://hecatomb.readthedocs.io/en/latest/ ) including tutorials, which are very helpful for many users. I think this point could be more emphasized in the manuscript. However, unfortunately, lacking the analysis of the mock dataset makes it hard to estimate the accuracy of the pipeline. I think adding such kinds of analysis for evaluating the performance would greatly improve the study.I have some suggestions that would increase the clarity and impact of this manuscript if addressed.Major:In general, to clearly evaluate the efficiency of the novel bioinformatic tools and pipelines, benchmarking using ground-truth datasets is important in advance to the application using real datasets. To reach this, in this case, some artificial datasets that are composed of known viral and prokaryotic genomes with defined composition and library types (single and paired-end) and sequenced read length (current short- and long-reads) could be designed as mock metagenome data. Via the analysis using the mock datasets, the accuracy of the pipeline can be evaluated. It would be appreciated if the author performed such benchmarking tests as well as the real data applications.According to the GitHub page, the Hecatomb is designed to generate results that reduce false-positive and enrich for true-positive viral read detection. This point is important for understanding the purpose of developing the pipeline and differentiating the pipeline tool from other ones. The efficiency of the false-positive reduction using this pipeline would be better clearly shown in this manuscript. Therefore the mock dataset analyses are expected.When I read the manuscript, I was confused about what the targeted dataset the pipeline aiming for. Is the Hecatomb designed to analyze common prokaryotic shotgun metagenomic data to detect viruses? In other words, is the pipeline not limited to analyzing viral metagenomes (viromes), which specifically enriched viral particles from the samples for sequencing (e.g., density centrifugation to condense viral particles)? The stool samples were likely virome datasets (viral particles were enriched via 0.45-Î¼m-pore-size membrane filtration according to the article), whereas the coral reef data are metagenome datasets. I would suggest that the terms "viral metagenome" (or virome, specifically targeting only viruses) and common "metagenome" (mainly focusing on prokaryotes) should be clearly distinguished throughout the manuscript including the title.I'm wondering about the sequence clustering step in Module 1. In my understanding, from the metagenomic settings, genomic regions are randomly sequenced, and thus most of the sequenced reads will not be clustered together using the criteria as described in the manuscript, and not so many sequences are reduced in this step. Is this step truly needed? Please add more explanation and importance about this step. For example, how many ratios of the reads were reduced in the test of the two real datasets (stool and coral reef) in this step?Minor:The introduction section is informative but a bit long. The section could be shortened.Some viruses were newly found using the pipeline (e.g., Fig1A). Which one is which virus types (dsDNA, ssDNA, dsRNA, ssRNA)? This information would be better to show clearly in the figure.I think the sequences derived from RNA viruses are generally not abundantly included in typical metagenomics datasets except if with specific techniques in the experiment. I think the potential for detecting RNA viruses from typical metagenomic DNA sequencing reads will be discussed in the Introduction section.L103. Please describe where the name "Hecatomb" is derived from in this article, though this is shown on the GitHub page.L119. " round A/B libraries" here, but I have not heard or could not find this term in the articles cited here. Please add more explanation of what is "round A/B libraries".L130 up to 2 insertions and deletions?L131. BBmap included in BBtools [73]?L181. A brief explanation of the "Baltimore classification" here would improve the readability for readers who are not familiar with this.L239. There is no explanation of what "SIV" means before.L253-L268 & Figure 4B. According to Figure 2A, there are two paths (1,2,5: aa and 1,3,4,5: nt) for detecting viral reads. I'm interested in which path is major and which is minor. Could the authors provide the ratio of the reads that predicted using aa or nt in each dataset examination (each stool and coral)?L431, L436. Not only BioProject but SRA accession ID should be provided.L479. There is no LACC here. What is his main contribution? Just reviewing and editing the manuscript is insufficient for citing as an author: see https://www.icmje.org/recommendations/browse/roles-and-responsibilities/defining-the-role-ofauthors-and-contributors.html#twoFigure 1. There are some DBs newly created and used in the pipeline (e.g., Viral AA DB, Multi-kingdom AA DB, Virus NT DB, and Polymicrobial MT DB). I think it would be better to add how to make the DBs in this or other figures. This must contribute to understanding how to construct the DBs and why to use them in this pipeline.Figure 1. specified (1)-(4) in the legend, not just color.Figure 4A. Please provide the total number of sequencing reads in addition to the read count assigned to each virus.Figure 4C. CPM was not explained in the manuscript and not listed in L460.L490. Some references are incomplete. e.g., lack of article ID or page number (49, 79, 90, 94, 95, 96, 100, 101, 102), remaining unnecessary words ("academic.oup.com" in 90, 91), etc. Please check the reference list carefully.Figure S5. Alignment length (bp)Table S2. For calculating the best hit identify, what database was used?
2. GigaScience 01 Jul 2024
  
  in GigaScience
  
  Background Analysis of viral diversity using modern sequencing technologies offers extraordinary opportunities for discovery. However, these analyses present a number of bioinformatic challenges due to viral genetic diversity and virome complexity. Due to the lack of conserved marker sequences, metagenomic detection of viral sequences requires a non-targeted, random (shotgun) approach. Annotation and enumeration of viral sequences relies on rigorous quality control and effective search strategies against appropriate reference databases. Virome analysis also benefits from the analysis of both individual metagenomic sequences as well as assembled contigs. Combined, virome analysis results in large amounts of data requiring sophisticated visualization and statistical tools.
  
  Reviewer1: Arvind Varsani The MS titled "Hecatomb: An Integrated Software Platform for Viral Metagenomics" addresses the developed of a toolkit for viral meatgenomics analysis that assembles a variety of tools into a workflow.Overall, I do not have any issue with this MS or the toolkit.I have some minor points to help improve the MS and make it as current as possible.1. Line 40: I would include Cenote-take 2 PMID: 33505708, geNomad https://www.biorxiv.org/content/10.1101/2023.03.05.531206v12. Line 40: I would probably not cite the preprint of this current paper - see ref 21.3. Line 80: Actually Cenote-take (both version 1 and 2) both use HHMs and as far as I know so does geNomad.4. Line 248: Please note that Siphoviridae, Podoviridae and Myoviridae are not currently family names. See PMID: 366830755. This means you will likely need to edit you figure to collapse these to Caudovirales6. Line 250-251: Picornaviridae and Adenoviridiae should be in italics7. Line 270: Here and elsewhere, please note that a taxa do not infect a host, it is a virus that infects a host. "Mimiviridae, that infect Acanthamoeba, and Phycodnaviridae, that infect algae, are both dsDNA viruses with large genomes" should ideally be written as "Viruses in the family Mimiviridae infect Acanthamoeba and those in the family Phycodnavirida infect algae, are dsDNA viruses with large genomes."8. Figure 6: the name tags of the CDS/ ORFS are truncated e.g. replication initiate…, heat maturation prot…9. Figure 6: Major head protein should be major capsid protein.10. One thing that I would highlight is that none of the workflows / tool kits developed account for spliced CDS. This is a major issue in automation of virus genome annotation at the moment and with this there will be some degree of misidentification for taxa assignment.
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2022.05.15.492003v2
www.biorxiv.org www.biorxiv.org

ntsm: an alignment-free, ultra low coverage, sequencing technology agnostic, intraspecies sample comparison tool for sample swap detection

2
1. GigaScience 01 Jul 2024
  
  in GigaScience
  
  Findings The similarity between samples can be determined using indexed k-mer sequence variants. To increase statistical power, we use coverage information on variant sites, calculating similarity using a likelihood ratio-based test. Per sample error rate, and coverage bias (i.e. missing sites) can also be estimated with this information, which can be used to determine if a spatially indexed PCA-based pre-screening method can be used, which can greatly speed up analysis by preventing exhaustive all-to-all comparisons.
  
  Reviewer2: Qian Zhou In this paper, the authors have presented a tool, ntsm, which utilizes the k-mer distribution information directly from raw sequencing data for sample swap detection. The approach of bypassing the reference genome alignment step and saving computational resources is commendable. Utilizing k-mers for reference-free and de novo analysis of sequencing data is a valuable application. The authors have demonstrated the impressive performance of ntsm on low coverage data through experimental results presented in the manuscript, showcasing its strengths in terms of sensitivity, accuracy. However, while ntsm eliminates the need for reference genome alignment, it still relies on a pre-defined set of variant sites and pre-built PCA rotation matrices. This raises doubts about the true reference-free nature of ntsm and raises concerns about its generalizability to other species.Major comments:1.The concept of reference-free:I believe that ntsm's approach is not truly reference-free. In order to use ntsm, it requires the use of existing high-quality population SNP sites and kmers from the human reference genome. Additionally, the population PCA results are used to assist in pairwise comparisons between samples. Both of these information can only be obtained when a reference genome is available. A true referencefree tool would be applicable to species without a reference genome, such as SPLASH (Chaung et al., 2023, Cell). ntsm can be considered as an alignment-free or kmer-based tool.2.The reduction of computational costs:NTSM differs from Somalier in its computational workflow. To compare the computational costs or time, a holistic end-to-end comparison is necessary, rather than timing individual steps such as kmer counting and sample pairwise comparison separately. Conducting an end-to-end comparison for an analysis task allows users to have a comprehensive understanding of the tool's time and cost consumption. Furthermore, when comparing software, it is important to allocate computational resources fairly. For example, ntsm utilizes 16 threads in the 'Sample comparison process' stage, while for the 'k-mer counting (ntsm) vs. alignment (somalier)' stage, tools like bwa and minimap2, which can utilize multiple threads, were run using a single thread.3.Sensitivity and Specificity:More experimental details are needed. In the section 'Sensitivity and Specificity of Sample Swaps,' were the results obtained using the 39 HPRC samples? Did it include their Hi-C data?For Fig 6, did the results come from all sequencing datasets of the 39 samples, including Illumina and ONT? Since the results was obtained using full coverage, would the threshold change at lower coverage?For Fig 7, which demonstrates ntsm's results, was PCA information used as an auxiliary? Does the use of PCA information impact Sensitivity and Specificity?4.Regarding PCA-based method:The 39 HPRC samples used in the study are actually part of the 3,202 samples from the 1000 Genomes Project. Therefore, it is important to clarify whether the PCA matrix used in the study already includes information from these 39 samples. From a rigorous experimental design perspective, a precomputed PCA matrix should not include information from the 39 samples. Otherwise, the effect of the PCA matrix on these 39 samples may be overestimated. It raises questions about whether the same results can be achieved on non-1000 Genomes Project samples.5.The applicability of the tool:In order to expand the applicability of ntsm to a wider range of species, two aspects need to be addressed:1). Provide detailed information on customizing the sites file. From the site files available in ntsm code repository on GitHub, the process of selecting variant sites seems to be more complex than what is described in the manuscript, involving more than just SNP variants.2). The sites and PCA files should be user-customizable inputs instead of being built-in. This limitation restricts the application of ntsm to other species.Minor comments:The manuscript appears to have been hastily written and requires further polish by the authors.1. In Figure 6, A and B seem to be labeled incorrectly.2. In Figure 9, the two subplots have different y-axes, one labeled "min" and the other labeled "s." Could you clarify what each subplot is illustrating?3. When mentioning HPRC for the first time, it would be helpful to provide the full name and explanation of the acronym. However, the full explanation appears in the next paragraph.4. "We then keep only purine to pyrimidine (A or T to G or C) variants, as final insurance against possible human error influencing this tool" It seems there may be a mistake or confusion in the sentence. The writer should indeed mention "A/G <-> C/T" instead of "A/T <-> G/C" to accurately describe purine to pyrimidine variants. The writer may have made an error in describing the nucleotide exchange, or it could be a typographical mistake.5. There is a typo in the formula for estimating sequencing error rate. (nm)Â·log(1-… …
2. GigaScience 01 Jul 2024
  
  in GigaScience
  
  Background Due to human error, sample swapping in large cohort studies with heterogeneous data types (e.g. mix of Oxford Nanopore, Pacific Bioscience, Illumina data, etc.) remains a common issue plaguing large-scale studies. At present, all sample swapping detection methods require costly and unnecessary (e.g. if data is only used for genome assembly) alignment, positional sorting, and indexing of the data in order to compare similarly. As studies include more samples and new sequencing data types, robust quality control tools will become increasingly important.
  
  Reviewer1: Jianxin Wang In this manuscript, authors present a fast intra-species sample swap detecting tool, named ntsm. By counting the relevant variant k-mers from samples, it estimates the probability of each allele at sites and then uses the likelihood ratio test to detect sample swaps. Compared with the alignment-based method, Somalier, nsam performs better on low coverage data (â‰¤5X) and is more efficient in terms of memory and computing time. The authors use PCA-based spatial index heuristic to reduce the number of sample comparisons. Of course, in my opinion, compared with the time spent on counting k-mer, the time saved by the PCA-based method is trivial. In addition, ntsm also provides other features such as error rate estimation. The tool requires population snp information, which limits its applications in practice to some extent. Overall, ntsm is a fast and practical tool for calculating intra-species sample similarity and detecting sample swaps. The writing and experiments in this paper are generally well done. There are some major and minor issues that I suggest the authors consider addressing.Major issues:The paper mentions that due to high error rates, nanopore data is difficult to analyze. Can the authors analyze the performance of ntsm under different error rate data? In general, alignment-based methods may perform better on high error rate data. This is very useful information for users to choose the tool.The authors use the PCA-based spatial index heuristic to reduce the number of pairwise comparisons. However, the relation between PCA distance and similarity score is not clear here. How to ensure that samples with similarity scores less than the threshold are within the search radius?The paper involves two metrics, say, similarity score and relatedness, to detect sample swaps. Can the authors analyze the relation between them to help readers understand the advantages and disadvantages of the two methods?Minor issues:In the "Conlusions" section, the second "useful" in the sentence "this method provides other useful information useful in QC" is redundant."R=1, p<2.2e-16" in Figure 3 is not explained.In the "Sequencing error rate estimation" section, the variable n is not explained.In Figure 9, the case of the first letter of two y-axis labels (time) is inconsistent.
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2023.11.01.565041v1
May 2024
www.biorxiv.org www.biorxiv.org

Chromosomal-level genome assembly of golden birdwing Troides aeacus (Felder & Felder, 1860)

2
1. GigaScience 16 May 2024
  
  in GigaByte
  
  Editors Assessment:
  
  This work is part of a series of papers from the Hong Kong Biodiversity Genomics Consortium sequencing the rich biodiversity of species in Hong Kong. This example presents the genome of the golden birdwing butterfly Troides aeacus (Lepidoptera, Papilionidae). A notable and popular species in Asia that faces habitat loss due to urbanization and human activities. The lack of genomic resources impedes conservation efforts based on genetic markers, as well as better understanding of its biology. Using PacBio HiFi long reads and Omni-C a 351Mb genome was assembled genome anchored to 30 pseudo-molecules. After reviewers requested more information on the genome quality it seems there was high sequence continuity with contig length N50 = 11.67 Mb and L50 = 14, and scaffold length N50 = 12.2 Mb and L50 = 13. Allowing a total of 24,946 protein-coding genes were predicted. This study presents the first chromosomal-level genome assembly of the golden birdwing T. aeacus, a potentially useful resource for further phylogenomic studies of birdwing butterfly species in terms of species diversification and conservation. This evaluation refers to version 1 of the preprint
  
  Summary
2. GigaScience 16 May 2024
  
  in GigaByte
  
  AbstractTroides aeacus, the golden birdwing (Lepidoptera, Papilionidae) is a large swallowtail butterfly widely distributed in Asia. Despite its occurrence, T. aeacus has been assigned as a major protective species in many places given the loss of their native habitats under urbanisation and anthropogenic activities. Nevertheless, the lack of its genomic resources hinders our understanding of their biology, diversity, as well as carrying out conservation measures based on genetic information or markers. Here, we report the first chromosomal-level genome assembly of T. aeacus using a combination of PacBio SMRT and Omni-C scaffolding technologies. The assembled genome (351 Mb) contains 98.94% of the sequences anchored to 30 pseudo-molecules. The genome assembly also has high sequence continuity with scaffold length N50 = 12.2 Mb. A total of 28,749 protein-coding genes were predicted, and high BUSCO score completeness (98.9% of BUSCO metazoa_odb10 genes) was also revealed. This high-quality genome offers a new and significant resource for understanding the swallowtail butterfly biology, as well as carrying out conservation measures of this ecologically important lepidopteran species.
  
  This work has been published in GigaByte Journal under a CC-BY 4.0 license (https://doi.org/10.46471/gigabyte.122), and has published the reviews under the same license. This is part of a thematic series presenting Data Releases from the Hong Kong Biodiversity Genomics consortium (https://doi.org/10.46471/GIGABYTE_SERIES_0006). These are as follows.
  
  Reviewer 1. Dr. Kumar Saurabh Singh
  
  Are the data and metadata consistent with relevant minimum information or reporting standards?
  
  No. 1. I've noticed that the genome assembly file has been uploaded to NCBI, but I couldn't locate the corresponding annotation files in GFF format. Additionally, I couldn't find gene models for Troides aeacus on NCBI or any other platform. As per Giga Science data policy, these files should be made publicly available. 2. The paper lacks information on the contig N50 and L50, although I did find this data on NCBI. Is there a specific reason for omitting the contig N50/L50 details from the main text or tables?
  
  Is there sufficient detail in the methods and data-processing steps to allow reproduction?
  
  Yes. 1. I have noticed that the QV value is missing for the given assembly. To assess the base-level accuracy of your assembly, the authors should calculate the consensus quality (QV), comparing the frequency of k-mers present in the raw Omni-C reads (as you only have short-reads from Omni-c) with those present across the final assembly perhaps using Merqury. 2. Incorporating Omni-c data did not result in a significant increase in the contig N50. Have you identified any specific reasons for this outcome? 3. The overall BUSCO completeness for proteins appears to be disproportionately low (~86%) compared to genomic completeness (~98%). Could this be attributed to the absence of RNAseq data for predicting accurate gene models?
  
  Is there sufficient data validation and statistical analyses of data quality?
  
  I believe it's essential to assess the assembly quality through comparative genomic analyses, a component seemingly missing from the manuscript. While the text mentions the availability of genomic resources within the same genus, conducting a genome-wide comparison of these assemblies could provide valuable insights into the overall synteny and contiguity of the T. aeacus assembly. To ensure annotation consistency, it's important to compare genome assemblies by generating distributions of intron/exon lengths for annotations across multiple assemblies.
  
  Reviewer 2. Dr.Xueyan Li
  
  Link to review: https://gigabyte-review.rivervalleytechnologies.comdownload-api-file?ZmlsZV9wYXRoPXVwbG9hZHMvZ3gvRFIvNDk1L0dpZ2FieXRlRFJSLTIwMjQwMS0wMS1jb21tZW50cy5kb2N4
  
  Re-review: The paper has substantially been enhanced after the first revision. I suggest that this manuscript can be published after the following minor revisions: 1.L279: ‘formosanus’ is also part of the scientific name which should be Italic type. 2.It’s recommended to beautify the figures and tables.
Visit annotations in context

Tags

Summary

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2024.01.13.575334v1
www.biorxiv.org www.biorxiv.org

Chromosome-level genome assembly of the common chiton, Liolophura japonica (Lischke, 1873)

2
1. GigaScience 15 May 2024
  
  in GigaByte
  
  Editors Assessment:
  
  This work is part of a series of papers from the Hong Kong Biodiversity Genomics Consortium sequencing the rich biodiversity of species in Hong Kong. This example assembles the genome of the common chiton, Liolophura japonica (Lischke, 1873). Chitons are marine molluscs that can be found worldwide from cold waters to the tropics that play important ecological roles in the environment, but to date are lacking in genomes with only a few assemblies available. This data was produced using PacBio HiFi reads and Omni-C sequencing data, the resulting genome assembly being around 609 Mb in size. From this 28,010 protein-coding genes were predicted. After review improved the methodological details the quality metrics look near chromosome-level, having a scaffold N50 length of 37.34 Mb and 96.1% BUSCO score. This high-quality genome should hopefully be a valuable resource for gaining new insights into the environmental adaptations of L. japonica in residing the intertidal zones and for future investigations in the evolutionary biology in Polyplacophorans and other molluscs.
  
  This evaluation refers to version 1 of the preprint
  
  Summary
2. GigaScience 15 May 2024
  
  in GigaByte
  
  AbstractChitons (Polyplacophora) are marine molluscs that can be found worldwide from cold waters to the tropics, and play important ecological roles in the environment. Nevertheless, there remains only two chiton genomes sequenced to date. The chiton Liolophura japonica (Lischke, 1873) is one of the most abundant polyplacophorans found throughout East Asia. Our PacBio HiFi reads and Omni-C sequencing data resulted in a high-quality near chromosome-level genome assembly of ∼609 Mb with a scaffold N50 length of 37.34 Mb (96.1% BUSCO). A total of 28,233 genes were predicted, including 28,010 protein-coding genes. The repeat content (27.89%) was similar to the other Chitonidae species and approximately three times lower than in the genome of the Hanleyidae chiton. The genomic resources provided in this work will help to expand our understanding of the evolution of molluscs and the ecological adaptation of chitons.
  
  This work has been published in GigaByte Journal under a CC-BY 4.0 license (https://doi.org/10.46471/gigabyte.123), and has published the reviews under the same license. This is part of a thematic series presenting Data Releases from the Hong Kong Biodiversity Genomics consortium (https://doi.org/10.46471/GIGABYTE_SERIES_0006). These are as follows.
  
  Reviewer 1. Jin Sun
  
  Are all data available and do they match the descriptions in the paper?
  
  Yes. The assembly and annotations can be found in the Figshare.
  
  Is the validation suitable for this type of data?
  
  Yes. I have examined the HiC interaction map, and I think the scaffolding is high-quality.
  
  Additional Comments:
  
  The presentation is clear, but I would suggest the authors include the latest BUSCO score for the gene models.
  
  Reviewer 2. Priscila M Salloum
  
  Is the language of sufficient quality?
  
  Yes. The language is appropriate and does not hinder understanding, but some minor proof reading could benefit the manuscript. I left a few suggestions in my comments to the authors.
  
  Are all data available and do they match the descriptions in the paper?
  
  No. The data made available on NCBI has the 632 scaffolds, but the 13 pseudomolecules are not shown (in GCA_032854445.1, under Chromosomes, it reads “This scaffold-level genome assembly includes 632 scaffolds and no assembled chromosomes”), please clarify where information/data for the 13 pseudomolecules can be found. The figshare repository has the annotation files, but it lacks a metadata file detailing what each of the annotation files is (the file names are descriptive, but they do not replace a metadata file). The data availability statement lacks information about the transcriptomes (were these made available?) Supplementary tables are mentioned in the text file but were not made available (at least not for review).
  
  Are the data and metadata consistent with relevant minimum information or reporting standards?
  
  Yes. All that was provided was consistent.
  
  Is the data acquisition clear, complete and methodologically sound?
  
  No. Some clarification is needed (was the same sample used for the genome and transcriptome assembly? Were the different tissues processed in the same way? What software were used for all the bioinformatics steps? What were all the parameters and filters used for genome and transcriptome assembly and annotation?) I left specific suggestions in a file with additional comments to the authors.
  
  Is there sufficient detail in the methods and data-processing steps to allow reproduction?
  
  No. Software versions, citations, and parameters are missing from the methods section. Some results refer to methods not explained in the methods section.
  
  Is the validation suitable for this type of data?
  
  Yes. More details on the BlobTools parameters used are needed.
  
  Is there sufficient information for others to reuse this dataset or integrate it with other data?
  
  No. Supplementary tables were mentioned but not provided (at least not for review). There is enough information for others to reuse the genome data, although more information in the methods section (as mentioned above) and a metadata file would make this even more useful. There is no mention of where the transcriptome has been deposited, and an extremely brief mention to how it was assembled (e.g., no details on parameters used or software versions).
  
  Additional Comments: Please include all citations in the reference list.
  
  And see additional file with comments: https://gigabyte-review.rivervalleytechnologies.com/journal/gx/download-files?YXJ0aWNsZV9pZD00OTYmZmlsZT0xOTgmdHlwZT1nZW5lcmljJnZpZXc9dHJ1ZQ==
Visit annotations in context

Tags

Summary

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2024.01.15.575488v1
www.biorxiv.org www.biorxiv.org

Chromosomal-level genome assembly of the long-spined sea urchin Diadema setosum (Leske, 1778)

2
1. GigaScience 15 May 2024
  
  in GigaByte
  
  Editors Assessment:
  
  This work is part of a series of papers from the Hong Kong Biodiversity Genomics Consortium sequencing the rich biodiversity of species in Hong Kong. This example assembles the genome of the long-spined sea urchin Diadema setosum (Leske, 1778). Using PacBio HiFi long-reads and Omni-C data the assembled genome size was 886 Mb, consistent to the size of other sea urchin genomes. The assembly anchored to 22 pseudo-molecules/chromosomes, and a total of 27,478 genes including 23,030 protein-coding genes were annotated. Peer review added more to the conclusion and future perspectives. The data hopefully providing a valuable resource and foundation for a better understanding of the ecology and evolution of sea urchins.
  
  This evaluation refers to version 1 of the preprint
  
  Summary
2. GigaScience 15 May 2024
  
  in GigaByte
  
  AbstractThe long-spined sea urchin Diadema setosum is an algal and coral feeder widely distributed in the Indo-Pacific and can cause severe bioerosion on the reef community. Nevertheless, the lack of genomic information has hindered the study its ecology and evolution. Here, we report the chromosomal-level genome (885.8 Mb) of the long-spined sea urchin D. setosum using a combination of PacBio long-read sequencing and Omni-C scaffolding technology. The assembled genome contained scaffold N50 length of 38.3 Mb, 98.1 % of BUSCO (Geno, metazoa_odb10) genes, and with 98.6% of the sequences anchored to 22 pseudo-molecules/chromosomes. A total of 27,478 genes including 23,030 protein-coding genes were annotated. The high-quality genome of D. setosum presented here provides a significant resource for further understanding on the ecological and evolutionary studies of this coral reef associated sea urchin.
  
  This work has been published in GigaByte Journal under a CC-BY 4.0 license (https://doi.org/10.46471/gigabyte.121), and has published the reviews under the same license. This is part of a thematic series presenting Data Releases from the Hong Kong Biodiversity Genomics consortium (https://doi.org/10.46471/GIGABYTE_SERIES_0006). These are as follows.
  
  Reviewer 1. Phillip Davidson
  
  Is the language of sufficient quality?
  
  Yes. Minor language errors that should be corrected in copy-editing
  
  Additional Comments:
  
  In their work, Hui et al present a chromosome-level genome assembly for Diadema setosum, the long-spined urchin. This new data is especially exciting given no high-quality genomic resource for the Diadematoida is available, bolstering comparative genomics work of echinoderms and the study of this species. Overall, the methods and data are well described and have produced a high quality genome assembly and associated annotations that will be a valuable addition to the community. I have a handful of primarily minor suggestions detailed below:
  
  Major comments:
  
  Conclusions and future perspectives: Currently, this section is only a sentence and states the new assembly will “further understanding of ecology and evolution of sea urchins”, which I think is a little uninspiring. I think more detail can be provided in this section to explain how this genome assembly adds to current knowledge. For example, reiterating that this is the first chromosome-level Diadematoida assembly, or perhaps explaining with examples how a good reference genome can inform ecological studies. Overall, the significance of this work is not really explained which I think sells this nice work short.
  
  Minor comments:
  
  Lines 232-233 state the mean coding sequence is 483 bp which seems a bit low, but having examined the peptide fasta file, I believe the average amino acid length is 483 AA, giving an average coding sequence length of ~1449bp. Please confirm and correct if necessary. This would also increase the total # of coding basepairs listed in Table 1.
  
  Lines 66-71: The authors state there are 5 chromosome-level sea urchin assemblies, all of which are camarodonts. However, I believe there are at least three additional chromosome-level assemblies for sea urchins not mentioned: 1) Echinometra sp. EZ (Ketchum et al, 2022; https://academic.oup.com/gbe/article/14/10/evac144/6717576 ) and 2) Paracentrotus lividus (Marletaz et al, 2023; https://www.sciencedirect.com/science/article/pii/S2666979X23000617?via%3Dihub ) and 3) Strongylocentrotus purpuratus (https://www.echinobase.org/echinobase/) Further, P. lividus is not a camarodont, so the text should be corrected accordingly.
  
  Lines 106: Please state whether the individual samples for genome sequencing was male or female
  
  Lines 54-54: The BUSCO score is reported at 98.1% but it should be be specified if this is the complete BUSCO score or the single-copy BUSCO score. Ideally, the single copy and duplication scores, rather than the complete, score is reported so readers have an idea for the duplication rate/haploid-ness of the assembly. Same issue on lines 221. Thank you for reporting in Table 1.
  
  Line 56: Text states “27,478 genes including 23,030 protein coding genes” were annotated. Augustus often outputs genes and transcripts, so I am wondering if the authors mean 27K transcripts including 23K genes. If so, the authors should clarify. If not, I think a brief statement of what these additional 4K genes are would be informative
  
  Table 1: Please clarify if “HiFi (X): 21” is referring to 21X coverage. Please correct length of coding sequence to amino acid sequence, and total coding sequence length. Same with Figure 1 panel B.
  
  Reviewer 2. Remi Ketchum
  
  Minor Edits
  
  Line 62: Change to “lack a vertebral column” instead of “lack the” Line 64: Change to “sea urchins” instead of sea urchin Line 70: Ketchum et al 2022 in GBE produced a chromosome-level genome assembly of Echinometra sp. EZ so this citation should be included here. Line 91: change to “results in a reduction in coral community complexity”
  
  I think that the end of the introduction could use a sentence or two that explicitly states why this genome will be a valuable resource to the scientific community. I think this will also help wrap up the introduction.
  
  Line 101: Can you provide coordinates? Also could you remove the word ‘alive.’ Line 130: I am confused by what you mean “the sample was then proceeded” Line 181: Was this the same individual that you used for genomic DNA isolation? Line 196: please could you include the specific flags that you used for purge_dups? Did you run Hifiasm with the default parameters?
  
  Line 240: I would definitely try and include some more sentences in this section. Line 253: Is this section supposed to be here? I think this is meant to go into the methods section.
  
  The authors could think about potentially a comparison table of the different urchin genome stats that are available currently? I would also encourage the readers to generate KAT plots to validate that they have successfully collapsed the haplotypes – a common problem with higher heterozygosity.
  
  Reviewer 3. F. Marlétaz
  
  I think it would be great to give further detail on the statistics out of the hifiasm contiging step. What are the contig statistics (after the hifiasm step)?
Visit annotations in context

Tags

Summary

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2024.01.16.575490v1
www.biorxiv.org www.biorxiv.org

Genome assembly of the edible jelly fungus Dacryopinax spathularia (Dacrymycetaceae)

2
1. GigaScience 15 May 2024
  
  in GigaByte
  
  Editors Assessment:
  
  This work is part of a series of papers from the Hong Kong Biodiversity Genomics Consortium sequencing the rich biodiversity of species in Hong Kong. This example This work is part of a series of papers from the Hong Kong Biodiversity Genomics Consortium sequencing the rich biodiversity of species in Hong Kong. This example presenting the first whole genome assembly of Dacryopinax spathularia, an edible mushroom-forming fungus that is used in the food industry to produce natural preservatives. Using PacBio and Omni-C data a 29.2 Mb genome was assembled, with a scaffold N50 of 1.925 Mb and 92.0% BUSCO score demonstrating the quality (review pushing the authors to provide more detail and QC stats to help better convince on this). This data providing a useful resource for further phylogenomic studies in the family Dacrymycetaceae and investigations on the biosynthesis of glycolipids with potential applications in the food industry.
  
  This evaluation refers to version 1 of the preprint
  
  Summary
2. GigaScience 15 May 2024
  
  in GigaByte
  
  AbstractThe edible jelly fungus Dacryopinax spathularia (Dacrymycetaceae) is wood-decaying and can be commonly found worldwide. It has also been used in food additives given its ability to synthesize long-chain glycolipids. In this study, we present the genome assembly of D. spathularia using a combination of PacBio HiFi reads and Omni-C data. The genome size of D. spathularia is 29.2 Mb and in high sequence contiguity and completeness, including scaffold N50 of 1.925 Mb and 92.0% BUSCO score, respectively. A total of 11,510 protein-coding genes, and 474.7 kb repeats accounting for 1.62% of the genome, were also predicted. The D. spathularia genome assembly generated in this study provides a valuable resource for understanding their ecology such as wood decaying capability, evolutionary relationships with other fungus, as well as their unique biology and applications in the food industry.
  
  This work has been published in GigaByte Journal under a CC-BY 4.0 license (https://doi.org/10.46471/gigabyte.120), and has published the reviews under the same license. This is part of a thematic series presenting Data Releases from the Hong Kong Biodiversity Genomics consortium (https://doi.org/10.46471/GIGABYTE_SERIES_0006). These are as follows.
  
  Reviewer 1. Anton Sonnenberg
  
  Is the language of sufficient quality? Yes.
  
  Are all data available and do they match the descriptions in the paper? Yes.
  
  Is the data acquisition clear, complete and methodologically sound? Yes.
  
  Is there sufficient detail in the methods and data-processing steps to allow reproduction? Yes.
  
  Figure 1E could be improved by eliminating in the pie-chart the non-repeat sequences or bar-plot the repeats. That will visualize better the frequencies of each type of repeats.
  
  Reviewer 2. Riccardo Iacovelli
  
  Is the language of sufficient quality? No.
  
  There are several typos spread across the text, and some sentences are written in an unclear manner. I provide some suggestions in the attachment.
  
  Are all data available and do they match the descriptions in the paper?
  
  Yes, but some of the data shown is rather unclear and/or not supported by sufficient explanation. For example, what is actually Fig. 1C showing? Because the reference in the text (which contains a typo, line 197) refers to something else. What is the second set of stats in Fig. 1B? This other organism is not mentioned at all anywhere in the manuscript.
  
  Are the data and metadata consistent with relevant minimum information or reporting standards?
  
  No. NCBI TaxID of the sequenced species object of this work is missing.
  
  Is there sufficient detail in the methods and data-processing steps to allow reproduction?
  
  No. In my opinion, some of the procedures described for the processing of the sample and library prep for sequencing are reported in an unclear way. For example, lines 100-103: no details on RNAse A treatment; how do you define chloroform:IAA (24:1) washes? how much supernatant is added to how much H1 buffer to have the final volume of 6 ml? Another example, lines 180-175: what parameters did you use for EvidenceModeler to generate the final consensus genes model? The weight given to each particular prediction set is important.
  
  Is there sufficient data validation and statistical analyses of data quality?
  
  No/ While sufficient data validation and statistical analyses have been carried out with respect to DNA sequencing and genome assembly, nothing is reported about DNA extraction and quality. The authors mention several times throughout the text that DNA preps are checked via NanoDrop, Qubit, gel electrophoresis, etc. But none of this is shown in the main body or in the supplementary information. Without this information, it is difficult to assess directly the efficacy of DNA extraction and preparation methods. I recommend including this type of data.
  
  Additional Comments:
  
  In this article, the authors report the first whole genome assembly of Dacryopinax spathularia, an edible mushroom-forming fungus that is used in the food industry to produce natural preservatives. In general, I find the data of sufficiently high quality for release, and I do agree with the authors in that it will prove useful to gain further insights into the ecology of the fungus, and to better understand the genetic basis of its ability to decay wood and produce valuable compounds. This can ultimately lead to discoveries with applications in biotech and other industries.
  
  Nevertheless, during the review process I noticed several shortcomings with respect to unclear language, insufficient description of the experimental procedures and/or results presented, and missing data altogether. These are all discussed within the checklist available in the ReView portal. For minor comments line-by-line, see below:
  
  1: Dacrymycetaceae should be italicized (throughout the whole manuscript). This follows the convention established by The International Code of Nomenclature for algae, fungi, and plants (https://www.iaptglobal.org/icn). Although not binding, this allows easy recognition of taxonomic ranks when reading an article. 49: other fungus -> other fungi 56: photodynamic injury -> UV damage/radiation (photodynamic is used with respect to light-activated therapies etc.) 60: in food industry as natural preservatives in soft drinks -> in food industry to produce natural preservatives for soft drinks 68: cultivated in industry as food additives -> cultivated in industry to produce food additives 69: isolated fungal extract -> the isolated fungal extract 71: What do you mean by Pacific? It’s unclear 71-72: the genomic resource -> genomic data/ genome sequence 72: I would remove “with translational values”, it is very vague and does not add anything to the statement 78: genomic resource -> genomic data/ genome sequence 78-81: this could be rephrased in a smoother manner: e.g. something like “the genomic data will be useful to gain a better understanding of the fungus’ ecology as well as the genetic basis of its wood-decaying ability and…” 85: fruit bodies -> fruiting bodies 88-89: Grown hyphae from >2 week-old was transferred  Fungal hyphae from 2-week old colonies were transferred 90-91: validated with the DNA barcode of Translation  assigned by DNA barcoding using the sequence of Translation… 95: ~ -> Approximately (sentences are not usually started with symbols or numbers) 101-3: Procedure is not clear enough (see other comments through ReView portal) 124: for further cleanup the library -> to further clean up the library / for further cleanup of the library 132: as line 95 152: as lines 95, 132 181-5: Insufficient description of methods, see comments through ReView portal 197: Figure and 1C; Table 2 -> Figure 1C and Table 2 200: average protein length of 451 bp -> average protein-coding gene length / average protein length of ~150 amino acids 211: via the fermentation process with applications in the food industry -> via the fermentation process with potential applications in the food industry
  
  As a fungal biologist myself interested in fungal genomics and biotechnology, I would like to thank the authors for carrying out this work and the editor for the opportunity to review it. I am looking forward to reading the revised version of the manuscript.
  
  Riccardo Iacovelli, PhD GRIP, Chemical and Pharmaceutical Biology department University of Groningen, Groningen - The Netherlands
Visit annotations in context

Tags

Summary

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2024.01.16.575489v1
www.biorxiv.org www.biorxiv.org

Genome assembly of the milky mangrove Excoecaria agallocha

2
1. GigaScience 14 May 2024
  
  in GigaByte
  
  Editors Assessment:
  
  This work is part of a series of papers from the Hong Kong Biodiversity Genomics Consortium sequencing the rich biodiversity of species in Hong Kong. This example assembles the genome of the milky mangrove Excoecaria agallocha, also known as blind-your-eye mangrove due to its toxic properties of its milky latex that can cause blindness when it comes into contact with the eyes. Living in the brackish water of tropical mangrove forests from India to Australia, they are an extremely important habitat for a diverse variety of aquatic species, including the mangrove jewel bug of which this tree is the sole food source for the larvae. Using PacBio HiFi long-reads and Omni-C technology a 1,332.45 Mb genome was assembled, with 1,402 scaffolds and a scaffold N50 of 58.95 Mb. After feedback the annotations were improved, predicting a very high number (73,740) protein coding genes. The data presented here provides a valuable resource for further investigation in the biosynthesis of phytochemical compounds in its milky latex with the potential of many medicinal and pharmacological properties. As well as increasing the understanding of biology and evolution in genome architecture in the Euphorbiaceae family and mangrove species adapted to high levels of salinity.
  
  This evaluation refers to version 1 of the preprint
  
  Summary
2. GigaScience 14 May 2024
  
  in GigaByte
  
  AbstractThe milky mangrove Excoecaria agallocha is a latex-secreting mangrove that are distributed in tropical and subtropical regions. While its poisonous latex is regarded as a potential source of phytochemicals for biomedical applications, the genomic resources of E. agallocha remains limited. Here, we present a chromosomal level genome of E. agallocha, assembled from the combination of PacBio long-read sequencing and Omni-C data. The resulting assembly size is 1,332.45 Mb and has high contiguity and completeness with a scaffold N50 of 58.9 Mb and a BUSCO score of 98.4 %. 73,740 protein-coding genes were also predicted. The milky mangrove genome provides a useful resource for further understanding the biosynthesis of phytochemical compounds in E. agallocha.
  
  This work has been published in GigaByte Journal under a CC-BY 4.0 license (https://doi.org/10.46471/gigabyte.119), and has published the reviews under the same license. This is part of a thematic series presenting Data Releases from the Hong Kong Biodiversity Genomics consortium (https://doi.org/10.46471/GIGABYTE_SERIES_0006). These are as follows.
  
  Reviewer 1. Minghui Kang
  
  Is the data acquisition clear, complete and methodologically sound?
  
  The sample collection site needs to include latitude and longitude data.
  
  Is there sufficient detail in the methods and data-processing steps to allow reproduction?
  
  Please add the software version number to all the software mentioned in the manuscript. Additionally, if the software uses default parameters, please provide the corresponding description. If specific parameters are used, please indicate the corresponding parameters
  
  Additional Comments: This study presents the assembly of an Excoecaria agallocha genome using PacBio HiFi and Omni-C technologies. The assembly exhibits good contiguity and completeness, providing a valuable resource for further understanding the phylogenetic position, evolutionary history, and natural product biosynthesis in Excoecaria agallocha. However, there are still some issues that need to be addressed and modified, including the following points: L82 It would be preferable to mention the number of chromosomes and the anchor rate of the chromosome-scale assembly here, as well as the estimated genome size based on K-mer analysis, to further support the accuracy and completeness of the assembly. L88 I think the authors need to rearrange the order of the figures, as it is not appropriate for Fig. 1F to appear before Fig. 1A. Please check the results part and arrange the pictures in a reasonable order. L117 The sample collection site needs to include latitude and longitude data. L187 Please add the software version number to all the software mentioned in the manuscript. Additionally, if the software uses default parameters, please provide the corresponding description. If specific parameters are used, please indicate the corresponding parameters. L219 The pseudochromosome scaffolding rate of 86.08% appears to be somewhat low (<90%). The sequences that were not scaffolded onto chromosomes could be a result of untrimmed redundancy in the genome assembly or could indicate some assembly errors. L220 Please note that in this instance, Fig. 1C appears before Fig. 1B in the text. I kindly request the author to review and adjust the numbering and arrangement of figures throughout the entire manuscript. L223 The quality of gene annotation appears to be significantly lower than the quality of genome assembly (82.1%/98.4%), indicating poor gene annotation accuracy. Please review the accuracy of the HMM model trained by the Augustus software or consider using a more accurate annotation workflow. L225 Unclassified repetitive sequences account for over 50% of the total repetitive sequences, which can significantly impact subsequent analyses relying on repetitive sequences. It is recommended to use alternative software, such as The Extensive de novo TE Annotator (EDTA), which provides more accurate classification and utilizes a more comprehensive repetitive sequence library, to validate these results.
  
  Reviewer 2. Dr.Jarkko Salojarvi
  
  Is the language of sufficient quality? Yes. Are all data available and do they match the descriptions in the paper? Yes Are the data and metadata consistent with relevant minimum information or reporting standards? Yes Is the data acquisition clear, complete and methodologically sound? Yes Is there sufficient detail in the methods and data-processing steps to allow reproduction? Yes Is there sufficient data validation and statistical analyses of data quality? Yes Is the validation suitable for this type of data? Yes Is there sufficient information for others to reuse this dataset or integrate it with other data? Yes
Visit annotations in context

Tags

Summary

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2024.01.13.575302v1
www.biorxiv.org www.biorxiv.org

Whole genome assembly and annotation of the King Angelfish (Holacanthus passer) gives insight into the evolution of marine fishes of the Tropical Eastern Pacific

2
1. GigaScience 14 May 2024
  
  in GigaByte
  
  Editors Assessment:
  
  The King Angelfish (Holacanthus passer) is a great example of a Holacanthus angelfish that are some of the most iconic marine fishes of the Tropical Eastern Pacific. However, very limited genomic resources currently exist for the genus and these authors have assembled and annotated the nuclear genome of the species, and used it examine the demographic history of the fish. Using nanopore long reads to assemble a compact 583 Mb reference with a contig N50 of 5.7 Mb, and 97.5% BUSCOs score. Scruitinising the data, the BUSCO score was high compared to the initial N50’s, providing some useful lessons learned on how to get the most out of ONT data. The analysis suggests that the demographic history in H. passer was likely shaped by historical events associated with the closure of the Isthmus of Panama, rather than by the more recent last glacial maximum. This data provides a genomic resource to improve our understanding of the evolution of Holacanthus angelfishes, and facilitating research into local adaptation, speciation, and introgression of marine fishes. In addition, this genome can help improve the understanding of the evolutionary history and population dynamics of marine species in the Tropical Eastern Pacific.
  
  This evaluation refers to version 1 of the preprint
  
  Summary
2. GigaScience 14 May 2024
  
  in GigaByte
  
  AbstractHolacanthus angelfishes are some of the most iconic marine fishes of the Tropical Eastern Pacific (TEP). However, very limited genomic resources currently exist for the genus. In this study we: i) assembled and annotated the nuclear genome of the King Angelfish (Holacanthus passer), and ii) examined the demographic history of H. passer in the TEP. We generated 43.8 Gb of ONT and 97.3 Gb Illumina reads representing 75X and 167X coverage, respectively. The final genome assembly size was 583 Mb with a contig N50 of 5.7 Mb, which captured 97.5% complete Actinoterygii Benchmarking Universal Single-Copy Orthologs (BUSCOs). Repetitive elements account for 5.09% of the genome, and 33,889 protein-coding genes were predicted, of which 22,984 have been functionally annotated. Our demographic model suggests that population expansions of H. passer occurred prior to the last glacial maximum (LGM) and were more likely shaped by events associated with the closure of the Isthmus of Panama. This result is surprising, given that most rapid population expansions in both freshwater and marine organisms have been reported to occur globally after the LGM. Overall, this annotated genome assembly will serve as a resource to improve our understanding of the evolution of Holacanthus angelfishes while facilitating novel research into local adaptation, speciation, and introgression in marine fishes.
  
  This work has been published in GigaByte Journal under a CC-BY 4.0 license (https://doi.org/10.46471/gigabyte.115), and has published the reviews under the same license. These are as follows.
  
  Reviewer 1. Iria Fernandez Silva
  
  Is the language of sufficient quality? Yes. But, A "the" is missing before "clingfish" in line 171
  
  Additional Comments:
  
  The genome assembly presented is of high quality, with values of accuracy and completeness in pair with chromosome level assemblies. The study is very well presented in terms on quality of the results and clarity in the presentation of methods and results. An added value is that it allows understanding how different type of data and assemblers interact in improvng the assembly quality. I also found interesting to see how contiguity and completeness are not always correlated, as this assembly has a great completeness BUSCO score in spite of not having the greatest N50 (compared with the most modern assemblies). This is possibly inherent to the type of data (ONT reads) and this information may guide researchers in making decission over future assembly projects. The demographic analysis is a nice addition to the study, the results are coherent and add information interesting to study the evolution of reef fishes and the biogeography of the TEP. I would appeciate more detail in the captions of figure 4, particularly those of the figure 4D.
  
  Reviewer 2. Yue Song
  
  The sequencing and annotation of King Angelfish genomes is impressive and represents a significant addition to the genomic resources for marine fishes. By hybrid assembly, a high-quality genome was provided, and the relationship between historical dynamics of its population and geological events was further discussed. However, in the section on inferring the demographic history, there is no mention of how the author inferred the mutation rate of this species. In addition, the author obtained 486 contigs throughout the assembly using ONT data combined with short reads. Is it possible to further assemble these contigs into chromosomal level? Of course, this does not indicate that it must be achieved within this manuscript, but rather suggests the inclusion of additional discussion on methods to further enhance the referential value of this genome. Additional specific comments： (1) Line 86, I guess the author probably meant to say there were 486 contigs, right? (2) Line 294, "gene models", not "gen models" (3) Line 110-111, it is puzzled my about the numbers in parentheses. I don't quite understand what these numbers mean. I haven't seen any explanation in this MS. Did I miss something? (4) If possible, it is recommended to show the phylogenetic relationships between these species in Figure 3.
Visit annotations in context

Tags

Summary

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2023.11.08.566026v1
www.biorxiv.org www.biorxiv.org

De novo transcriptome assembly and genome annotation of the fat-tailed dunnart (Sminthopsis crassicaudata)

2
1. GigaScience 13 May 2024
  
  in GigaByte
  
  Editors Assessment: Marsupial species are invaluable for comparative studies due to their distinctive modes of reproduction and development, but there are a shortage of genomic resources to do these types of analyses. To help address that data gap multi-tissue transcriptomes and transcriptome assemblies have been sequenced and shared for the fat-tailed dunnart (Sminthopsis crassicaudata), a mouse-like marsupial that due to is ease of breeding is emerging as a useful lab model. Using ONT nanopore and Pacbio long-reads and illumina short reads 2,093,982 transcripts were sequenced and assembled, and functional annotation of the assembled transcripts was also carried out. Some addition work was required to provide more details on the QC metrics and access to the data but this was resolved during review. This work ultimately producing dunnart genome assembly measuring 3.23 Gb in length and organized into 1,848 scaffolds, with a scaffold N50 value of 72.64 Mb. These openly available resources hopefully provide novel insights into the unique genomic architecture of this unusual species and provide valuable tools for future comparative mammalian studies.
  
  This evaluation refers to version 1 of the preprint
  
  Summary
2. GigaScience 13 May 2024
  
  in GigaByte
  
  AbstractMarsupials exhibit highly specialized patterns of reproduction and development, making them uniquely valuable for comparative genomics studies with their sister lineage, eutherian (also known as placental) mammals. However, marsupial genomic resources still lag far behind those of eutherian mammals, limiting our insight into mammalian diversity. Here, we present a series of novel genomic resources for the fat-tailed dunnart (Sminthopsis crassicaudata), a mouse-like marsupial that, due to its ease of husbandry and ex-utero development, is emerging as a laboratory model. To enable wider use, we have generated a multi-tissue de novo transcriptome assembly of dunnart RNA-seq reads spanning 12 tissues. This highly representative transcriptome is comprised of 2,093,982 assembled transcripts, with a mean transcript length of 830 bp. The transcriptome mammalian BUSCO completeness score of 93% is the highest amongst all other published marsupial transcriptomes. Additionally, we report an improved fat-tailed dunnart genome assembly which is 3.23 Gb long, organized into 1,848 scaffolds, with a scaffold N50 of 72.64 Mb. The genome annotation, supported by assembled transcripts and ab initio predictions, revealed 21,622 protein-coding genes. Altogether, these resources will contribute greatly towards characterizing marsupial biology and mammalian genome evolution.
  
  This work has been published in GigaByte Journal under a CC-BY 4.0 license (https://doi.org/10.46471/gigabyte.118), and has published the reviews under the same license. These are as follows.
  
  Reviewer 1: Qiye Li
  
  For the ONT, PacBio and Illumina data for genome assembly, is there any new data that was generated in this manuscript? Are all of the data collected from the same individual? If so, what is the gender of the individual for genome assembly? It will be appreciated to make this information clear to readers. Page 3: I think "Pacific Biosciences CRL" should be modified to "Pacific Biosciences CLR"
  
  Reviewer 2. Emma Peel.
  
  Are all data available and do they match the descriptions in the paper?
  
  No. The figshare link doesn't work, but I'm presuming this is because the paper hasn't been published? Will data be accessioned in the GigaScience Database to ensure accessiblity? The illumina short-read genomic and RNAseq datasets are available through NCBI and match descriptions in the paper. I was unable to find the raw PB and ONT data from [68] that was used to generate the genome assembly. The authors of [68] indicate these datasets are available in supplementary table 3, but if you click through the figshare link in this table the raw data isn't there, nor anywhere else listed in the data availability section. Can the authors please clarify the location of the raw data and update the data availability section of this manuscript accordingly.
  
  Are the data and metadata consistent with relevant minimum information or reporting standards?
  
  Yes. Access to the GigaDB accession hasn't been provided, so I am unable to determine if the data and metadata is consistent with minimum information reporting standards according to the GigaDB checklists.
  
  Is the data acquisition clear, complete and methodologically sound?
  
  Yes. Some minor clarifications are required, see comments in the PDF. For example, please include detail on how RNA quality was determined (e.g. RIN numbers) and provide more detail regarding method of library preparation, flowcell and instrument used for Illumina sequencing.
  
  Is there sufficient detail in the methods and data-processing steps to allow reproduction?
  
  Yes. The only detail lacking is the method of transcript quantification used to determine the top 90% most highly expressed transcripts.
  
  Is the validation suitable for this type of data?
  
  Yes. Data validation is suitable, however I would like to see a comparison of v1.1 genome assembly with other marsupial genome assemblies.
  
  Additional Comments:
  
  This study is an important addition to marsupial omics resources, and I was excited to see such a comprehensive set of transcriptomes. My main comment is the need to explain and discuss the initial assembly (v1) in the introduction to provide context for the improved assembly. See comments in the attached PDF.
  
  Annotated paper: https://gigabyte-review.rivervalleytechnologies.comdownload-api-file?ZmlsZV9wYXRoPXVwbG9hZHMvZ3gvRFIvNDg3L2d4LURSLTE3MDE2Njk5NzdfRVAgKDIpLnBkZg==
Visit annotations in context

Tags

Summary

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2023.11.16.567318v1
www.biorxiv.org www.biorxiv.org

IPEV: Identification of Prokaryotic and Eukaryotic Virus-derived sequences in virome using deep learning

2
1. GigaScience 13 May 2024
  
  in GigaScience
  
  AbstractBackground The virome obtained through virus-like particle enrichment contain a mixture of prokaryotic and eukaryotic virus-derived fragments. Accurate identification and classification of these elements are crucial for understanding their roles and functions in microbial communities. However, the rapid mutation rates of viral genomes pose challenges in developing high-performance tools for classification, potentially limiting downstream analyses.Findings We present IPEV, a novel method that combines trinucleotide pair relative distance and frequency with a 2D convolutional neural network for distinguishing prokaryotic and eukaryotic viruses in viromes. Cross-validation assessments of IPEV demonstrate its state-of-the-art precision, significantly improving the F1-score by approximately 22% on an independent test set compared to existing methods when query viruses share less than 30% sequence similarity with known viruses. Furthermore, IPEV outperforms other methods in terms of accuracy on most real virome samples when using sequence alignments as annotations. Notably, IPEV reduces runtime by 50 times compared to existing methods under the same computing configuration. We utilized IPEV to reanalyze longitudinal samples and found that the gut virome exhibits a higher degree of temporal stability than previously observed in persistent personal viromes, providing novel insights into the resilience of the gut virome in individuals.Conclusions IPEV is a high-performance, user-friendly tool that assists biologists in identifying and classifying prokaryotic and eukaryotic viruses within viromes. The tool is available at https://github.com/basehc/IPEV.Competing Interest StatementThe authors have declared no competing interest.FootnotesRepair the typos of the title.
  
  Reviewer 2. Mohammadali Khan Mirzaei
  
  Yin et al. have developed a new tool to differentiate eukaryotic and prokaryotic viruses. The tool offers a potential benefit to the community, but there are several issues with the contribution in its current form, as discussed below.
  
  Major issues: The authors should separate their training and testing databases. Ideally, their testing dataset should include a set of previously unseen viruses that have their host experimentally confirmed. In addition, the performance of IPEV should be compared with tools commonly used in the field, including vcontact2: https://doi.org/10.1038/s41587-019-0100-8 and iPHoP: https://doi.org/10.1371/journal.pbio.3002083. However, none of these tools are developed to directly differentiate eukaryotic and prokaryotic viruses, identification of viral taxonomy or host range could lead to the identification of viral type. Moreover, the authors have used multiple approaches for their assessment of the type of viruses. Yet, it is not clear how they combined the results they generated by these approaches in their decisions.
  
  Minor issues: Please use either phageome or phages instead of phage virome. There are some typos in the text that need to be fixed.
2. GigaScience 13 May 2024
  
  in GigaScience
  
  AbstractBackground The virome obtained through virus-like particle enrichment contain a mixture of prokaryotic and eukaryotic virus-derived fragments. Accurate identification and classification of these elements are crucial for understanding their roles and functions in microbial communities. However, the rapid mutation rates of viral genomes pose challenges in developing high-performance tools for classification, potentially limiting downstream analyses.Findings We present IPEV, a novel method that combines trinucleotide pair relative distance and frequency with a 2D convolutional neural network for distinguishing prokaryotic and eukaryotic viruses in viromes. Cross-validation assessments of IPEV demonstrate its state-of-the-art precision, significantly improving the F1-score by approximately 22% on an independent test set compared to existing methods when query viruses share less than 30% sequence similarity with known viruses. Furthermore, IPEV outperforms other methods in terms of accuracy on most real virome samples when using sequence alignments as annotations. Notably, IPEV reduces runtime by 50 times compared to existing methods under the same computing configuration. We utilized IPEV to reanalyze longitudinal samples and found that the gut virome exhibits a higher degree of temporal stability than previously observed in persistent personal viromes, providing novel insights into the resilience of the gut virome in individuals.Conclusions IPEV is a high-performance, user-friendly tool that assists biologists in identifying and classifying prokaryotic and eukaryotic viruses within viromes. The tool is available at https://github.com/basehc/IPEV.
  
  This work has been published in GigaScience Journal under a CC-BY 4.0 license (https://doi.org/10.1093/gigascience/giae018), and has published the reviews under the same license. These are as follows.
  
  Reviewer 1: Guillermo Andres Rangel-Pineros
  
  Yin et al described the development and testing of IPEV, a deep-learning-based model that detects and discriminates sequences derived from prokaryotic and eukaryotic viruses in virome datasets. The model was developed using a set of reference viral sequences with known host information. The sequences were represented as sequence pattern matrices that contained values derived from the frequency and order of trinucleotide pairs. These matrices were subsequently used to train a 2D convolutional neural network that generates a 2-value vector for each input sequence, indicating the probability that the sequence corresponds to a prokaryotic or eukaryotic virus. The model was trained and tested using 5-fold cross validation on the reference set, and the authors assessed the robustness of the method using input datasets covering a range of homology and mutation rate values. Finally, the authors applied their model to a gut virome dataset from Shkoporov et al 2019.
  
  Indeed, IPEV represents a novel method that classifies viral sequences based on the type of host they target (prokaryotic or eukaryotic), and the results presented indicate that it efficiently covers a wide range of sequence lengths (from 100 bp). A model like IPEV provides a focus on eukaryotic viruses that is relatively shallow, in comparison with phages for which a wide range of prediction tools have been developed to date. Nevertheless, there are a few points that the authors need to address, particularly in relation to the robustness of the model:
  
  Major
  
  1) I am concerned about the number of reference sequences that were employed to train the model, and it makes me question its general applicability to viromes from any kind of biome. It would be great if the authors incorporated more sequences to their training and validation. Sources of viral sequences such as IMG/VR (https://img.jgi.doe.gov/cgi-bin/vr/main.cgi) and RVDB (https://rvdb.dbi.udel.edu/) could be useful for identifying further sequences, and generate a set that cover a much wider range of viral diversity. Perhaps, this could also lead to an improved performance for the gut datasets.
  
  2) Even though viral enrichment methods increase the concentration of viral DNA, the presence of contaminant DNA from other microbes in the enriched viral samples is common. Currently, the results do not indicate what the performance of the model would be in the presence of contaminating sequences. I suggest the authors to carry out tests that demonstrate the performance of IPEV when analysing a sample containing microbial contamination (ideally from both prokaryotes and eukaryotes) and demonstrate that IPEV is not prone to wrongly reporting these sequences as viruses.
  
  3) I find the results of the gut samples interesting and appropriate for the scope of IPEV. However, if IPEV is meant to be a general-purpose tool for virome analysis, it would be ideal if the authors provided results demonstrating the performance of the tool with samples from other biomes. For example, the authors could analyse datasets from the TARA Oceans project (e.g., 10.1016/j.cell.2019.03.040), some of which have already been assembled (https://www.ebi.ac.uk/ena/browser/view/PRJEB22493) .
  
  4) There are several instances in the manuscript where the authors indicate the existence of significant differences between metrics measured to compare the performance of tools (e.g., line 326: “which was significantly higher than the mean AUC values of …”), but there is no mention of statistical analyses conducted to reach those conclusions (except for the Wilcoxon rank-sum test in line 305). Please provide information on statistical tests conducted to identify the significant differences.
  
  Minor
  
  1) There is a reference missing in line 37. 2) In the sentence between lines 41-44, it is not clear what you are referring to with “identification of viral sequences”. Are you referring to viral vs non-viral, or to host identification? 3) Line 50: you mean “identification” or “differentiation”? 4) The two sentences between lines 49 – 52 seem redundant. I would suggest rewriting these into a single sentence. 5) Line 65: the latest version of ICTV taxonomy has 11,273 species. Please update this number. 6) Line 67: there is a newer version of VirSorter (VirSorter2), which has an expended scope in comparison with the older version. Please, modify the text to include the most up-to-date version of this tool. 7) There are some more tools with a varied range of strategies for viral prediction that are widely known among the community, which I feel should be mentioned in the introduction (e.g., VIBRANT, DeepVirFinder, PPR-Meta, etc). Even though none of these were explicitly designed for prediction of eukaryotic viruses, it’d be worth commenting on them. 8) Indicate the version of Virus-Host DB used, and the version or date when the viral data was retrieved from NCBI. 9) Line 124: do you mean 10 samples or 10 adults? If it’s the latter, please correct the sentence. 10) Line 130: by “genome sequences” are you referring to the assembled viral contigs? In that case, please clarify as it is currently ambiguous. 11) Tables 1 and 2, perhaps consider presenting these results as plots? I feel that the tables are rather hard to process. 12) Line 274: This is a rather old reference, are you sure the error rate for PacBio is still this high? I would suggest looking at more up-to-date references. 13) Line 279: replace “base insert or delete” with “insertions or deletions”. 14) Table 3: Indicate the length range of the analysed sequences in the header. 15) The section regarding the performance on functional proteins seems to include information that should be split between methods and results. Please modify accordingly. 16) Please italicise names of viral taxa wherever they are mentioned in the manuscript (e.g., Tubulavirales and Timlovirales in Line 300). 17) Line 320: This sounds as if the authors had conducted the experiments to collect the gut virome data. Rewrite to make it clear that these data were retrieved from a previous study. 18) Line 331: Based on which observation did you reach this conclusion? 19) Line 368: Wasn’t HTP developed for addressing a similar question? Please clarify. 20) Line 409-410: The way the sentence is written seems to indicate that plant viruses can also infect human cells and microorganisms. Please rewrite to make it clearer. 21) Regarding the tool’s text output, I would suggest modifying it to make it easier to parse (for example, leaving it as a tabular .csv file), and currently the header does not seem to accurately describe the contents of the file.
  
  Re-review: Yin et al described the development and testing of IPEV, a deep-learning-based model that detects and discriminates sequences derived from prokaryotic and eukaryotic viruses in virome datasets. The model was developed using a set of reference viral sequences with known host information. The sequences were represented as sequence pattern matrices that contained values derived from the frequency and order of trinucleotide pairs. These matrices were subsequently used to train a 2D convolutional neural network that generates a 2-value vector for each input sequence, indicating the probability that the sequence corresponds to a prokaryotic or eukaryotic virus. The model was trained and tested using 5-fold cross validation on the reference set, and the authors assessed the robustness of the method using input datasets covering a range of homology and mutation rate values. Finally, the authors applied their model to a gut virome dataset from Shkoporov et al 2019, and marine virome datasets from Gregory et al 2019. Indeed, IPEV represents a novel method that classifies viral sequences based on the type of host they target (prokaryotic or eukaryotic), and the results presented indicate that it efficiently covers a wide range of sequence lengths (from 100 bp). A model like IPEV provides a focus on eukaryotic viruses that is relatively shallow, in comparison with phages for which a wide range of prediction tools have been developed to date. In my opinion, the authors satisfactorily addressed the comments and suggestions made in the first round of review. I only have a few final suggestions to finalise the manuscript and have it ready for publication: 1) The authors include some text in the Discussion section (paragraph from line 423 to line 436, and paragraph from line 437 to 448) that, in my opinion, would fit better in the Results section. I suggest the authors include these in the Results section, and then in the Discussion comment how those results compare to other methods and what are their implications. 2) I would suggest modifying the sentence in line 42 like this: "Nonetheless, it is essential to note that enriched sample approaches carry the risk of losing valuable host or environmental information [8], potentially leading to inaccurate virus host identification and constraining subsequent analyses." 3) In the sentence starting in line 392, instead of "During" use "For".
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2023.05.28.541705v2
www.biorxiv.org www.biorxiv.org

Korea4K: whole genome sequences of 4,157 Koreans with 107 phenotypes derived from extensive health check-ups

2
1. GigaScience 13 May 2024
  
  in GigaScience
  
  AbstractWe present 4,157 whole-genome sequences (Korea4K) coupled with 107 health check-up parameters as the largest whole genomic resource of Koreans. Korea4K provides 45,537,252 variants and encompasses most of the common and rare variants in Koreans. We identified 1,356 new geno-phenotype associations which were not found by the previous Korea1K dataset. Phenomics analyses revealed 24 genetic correlations, 1,131 pleiotropic variants, and 127 causal relationships from Mendelian randomization. Moreover, the Korea4K imputation reference panel showed a superior imputation performance to Korea1K. Collectively, Korea4K provides the most extensive genomic and phenomic data resources for discovering clinically relevant novel genome-phenome associations in Koreans.Competing Interest StatementS.J., Y. J., H. R., Y.J.K., C.K, Yeonkyung K., Younghui K., Y. J. W., and B. C. K. are employees and Jong B. is the CEO of Clinomics Inc. The authors declare no other competing interests.
  
  Reviewer 2: Taras K Oleksyk, Ph.D.
  
  Comments to Author: The authors contribute 4,157 whole-genome sequences (Korea4K) coupled with 107 health check-up parameters as the largest genomic resource of the Korean Genome Project. It has likely characterized most of the common and very common genetic variants with commonly measured phenotypes for Koreans. It also discusses its applicability not only for the Korean population but also for other East Asian populations, and possibly to other national genome projects as well. This work makes a significant contribution of data that can be used in future genome-wide association studies in the context of the Korean population. The manuscript appears to cover a lot of ground: from methodological issues to the real-world applications of the dataset in healthcare. The authors adopt innovative methods like GREML, which have been reported to have higher accuracy compared to older methods. The authors are transparent about the limitations of their study, such as sample size and lack of sufficient data for rare diseases. They also acknowledge that phenomics analyses were not powerful enough for novel discoveries, indicating areas for future research. However, given the increasing importance of genomic data in healthcare and personalized medicine, the paper appears to be highly relevant. While the paper is well formulated, there are some issues that need to be addressed before is accepted for publication. See below: 1. You referred to the UK Biobank data for some of your analyses. Were there any limitations or caveats in comparing your dataset to the UK Biobank? What about other national genomic projects that are out there? How transferable do you think the Korea4K dataset would be to studies focusing on other populations outside East Asia? 2. Could you expand on any ethical considerations that were taken into account, especially in terms of data privacy and informed consent? 3. How was the data cleaned and preprocessed, and were there any missing data points? If so, how were these handled? What number of reads(before and after QC), and other quality metrics do the sequenced reads have? What was the average coverage across the genome? What was the read length? 4. How did you ensure the quality of the genomic data collected from different sources such as Korea1K and public data archives? The paper mentions mitigating batch effects through allele balance and manual checks. Could you provide more details on the methodology behind these checks and their efficiency? 5. Could you provide more information about the control group? Was it matched for age, sex, or other variables? How was the sample size determined, and does it provide enough statistical power to support your conclusions? 6. You mentioned that the statistical power of your study will increase with more participants. Would this have implications for other national genomes that are making similar projects? Please elaborate on how your sensitivity analysis could apply to other populations outside Korea. 7. The paper acknowledges the sample size as not sufficiently large for detecting weak associations, and admits that the sample size was not large enough to detect weak association signals. Have you considered statistical methods that can boost power in small samples? 8. Could you provide more details on the 107 clinical parameters used for the Korea4K phenome dataset? Were these parameters standardized across the different clinics and hospitals? 9. What criteria were used for initial sample filtering, particularly for excluding kinship? Could you clarify the steps taken to identify and filter the 64,301,272 SNVs and 8,776,608 Indels? How did you correct for batch effects arising from different Illumina NGS platforms and library preparations? Did you use specialized SNV calling software, or only GATK? 10. How were allele frequencies calculated and what considerations were made to interpret their biological significance? You mention that more than half of the singleton and doubleton variants were newly discovered. Could you elaborate on the methodology used to confirm these as novel variants? 11. The section on phenotypic correlations mentions 2,274 trait-trait relationships. How would you address the potential for population stratification affecting the results of your genetic and phenotypic correlations? How did you account for multiple comparisons in determining significant genetic correlations, and what corrections were applied to maintain the FDR? What measures were taken to ensure that the traits considered in this section were not subject to confounding and/or collider biases. 12. In your findings, Waist-Creatine showed opposite directions for genetic and phenotypic correlations. Could you elaborate on the potential implications or causes of this discrepancy? 13. Were there any other surprising or unexpected correlations, and what are their potential implications? 14. You mentioned that phenomics analyses were not powerful enough for novel discoveries. Could you elaborate more on what would be needed to make them more effective? 15. For the future implications, in terms of healthcare and personalized medicine, what do you see as the most immediate applications of the Korea4K dataset?
  
  Re-review: Thank you for providing an extensive answers to my questions. I am happy to recommend your paper to publication in its revised form.
2. GigaScience 13 May 2024
  
  in GigaScience
  
  AbstractWe present 4,157 whole-genome sequences (Korea4K) coupled with 107 health check-up parameters as the largest whole genomic resource of Koreans. Korea4K provides 45,537,252 variants and encompasses most of the common and rare variants in Koreans. We identified 1,356 new geno-phenotype associations which were not found by the previous Korea1K dataset. Phenomics analyses revealed 24 genetic correlations, 1,131 pleiotropic variants, and 127 causal relationships from Mendelian randomization. Moreover, the Korea4K imputation reference panel showed a superior imputation performance to Korea1K. Collectively, Korea4K provides the most extensive genomic and phenomic data resources for discovering clinically relevant novel genome-phenome associations in Koreans.
  
  This work has been published in GigaScience Journal under a CC-BY 4.0 license (https://doi.org/10.1093/gigascience/giae014), and has published the reviews under the same license. These are as follows.
  
  Reviewer 1: Pui-Yan Kwok
  
  Comments to Author: This manuscript describes the second phase of the Korean Genome Project (KGP) with 4,157 sets of whole-genome data (designated Korea4K). After error correction and sequencing data curation, the whole-genome sequencing (WGS) data from 3,614 unrelated were used in the analyses. They also analyzed 107 types of clinical traits from 2,685 healthy participants' health check-up reports over a 4-year period (2016-2019). They performed a range of analyses and claimed that this new data performed better than Korea1K, the first phase KGP dataset, in a number of ways. A larger Korean dataset adds to the global genome resource and provides further insights into the Korean population. However, the results are mostly descriptive and serve as a catalog without significant new insights. The results are as expected (Korea4K is a better imputation reference panel than Korea1K, new variants are identified in the population, new variants are found in association with various phenotypes, etc.) and this dataset is sufficiently large to capture all the common variants found in the homogeneous Korean population. The authors should address several issues: 1. The use of whole genome sequencing data in GWAS. The Bonferroni correction the authors used in their analysis was that for SNP array studies. They must do a formal correction with the many more variants found in WGS data and use a statistically sound correction for their analysis. The severe penalty for multiple testing using WGS data for GWAS is why few such studies have been done. I suspect that many of the associations will not reach statistical significance after proper correction, as the dataset is quite small for most traits under study. 2. The authors should use the new genome references for their variant calling (T2T reference and the Human Pangenome Reference), as the GRCh38 is no longer the gold standard and the results will be quite different with the most up-to-date references. Using the best human genome reference will make Korea4K more valuable. 3. The authors should clarify how many of the participants who contributed clinical data are unrelated.
  
  Re-review:
  
  The authors made attempts to address the issues raised previously but did not do so adequately. 1. Using the same GWAS cutoff of P <5E-8 and adding the FDR correction (Benjamini-Hochberg) does not solve the problem of multiple testing using whole genome sequencing data (where there are orders of magnitude more variants than those on typical SNP arrays) for the study. With clinical data available for only 2,262 samples, each phenotype under study will have a very small number of individuals, making the result of 2,314 variants from 30 clinical traits with significant association highly suspect. The authors should consult statisticians with experience using whole genome sequencing data for association studies to come up with a better statistical study design. 2. The authors acknowledge that using the newer reference will be a good approach but will not do so because the "T2T reference lacks enough annotation data" is not an adequate response. The point is to have the best variant calls for the Korea4K data, annotation is irrelevant until variants with significant association are identified. Claiming that they will do so in future versions of the project diminishes the significance of the current manuscript.
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2022.12.25.521908v1
www.biorxiv.org www.biorxiv.org

SRY: An Effective Method for Sorting Long Reads of Sex-limited Chromosome

3
1. GigaScience 13 May 2024
  
  in GigaScience
  
  AbstractMost of available reference genomes are lack of the sequence map of sex-limited chromosomes, that make the assemblies uncompleted. Recent advances on long reads sequencing and population sequencing raise the opportunity to assemble sex-limited chromosomes without the traditional complicated experimental efforts. We introduce a computational method that shows high efficiency on sorting and assembling long reads sequenced from sex-limited chromosomes. It will lead to the complete reference genomes and facilitate downstream research of sex-limited chromosomes.Competing Interest StatementThe authors have declared no competing interest.
  
  Reviewer 3. Arang Rhie
  
  Comments to Author: 1. In the introduction, add recent marker based graph phasing algorithms in long-reads, such as hifiasm trio and verkko trio mode after the T2T-Y. They are different from trio-binning, which tries to phase the reads upfront. Graph based phasing is using markers to determine haplotype specific paths to traverse. a. T2T-Y chromosome should be referencing Rhie et al., Nature 2023. Verkko is a successor of the manual efforts taken in T2T-Y, which should be also noted in the introduction. b. Reference for sexPhase program is still missing. Also, some rephrasing of the sentence is needed, as the way it is currently written is easily misleading to be understood as sexPhase was part of the methods used in the assembly of the T2T-Y. 2. There are other approaches for phasing genomes taken in plants, for example the poly ploid potato phasing using many siblings of the child by Mari et al. bioRxiv 2022.3. "But only one male and one female could suffer from sampling error" - this part is unclear. Please clarify. 4. Reference for the mason_simulator, badread software is missing. 5. Provide the accession (HG02982) for the "African human Y" in the main text. 6. I appreciate that the authors compared assemblies to T2T-Y as I requested before. However, fundamentally, mapping to T2T-Y and comparing length of each sequence classes is comparing apples to oranges, particularly in the heterochromatic region and ampliconic region of the Y. It is known to have variable copy numbers and size differences between two individuals. Frequent inversions have been reported in the ampliconic regions across different Y haplogroup. The number, size, and distribution of the repeat arrays composing the heterochromatic region has been shown to vary among different Y haplogroups in Hallast et al., Nature 2023. This can be also seen in Fig. 3c; the overall depth of the flow sorting in the heterochromatic region is below 1 - indicating the Yqh is shorter than T2T-Y, as it is in Fig. 3b. To make the benchmark legit, the authors should compare SRY and the flow sorting method using samples from the same individual. HG02982 and HX1 are presumably having very different sequence compositions given the diverged population history (African vs. Asian). Comparing total length of the assembled region against a 3rd different Y haplogroup (HG002Y) makes things more complicated, especially on regions that are known to vary a lot. If the authors think flow sorting based method needs to be compared, it should be benchmarked on the same individual to make an apple-to-apple comparison. I do agree results from read sorting (i.e. portion of reads sequenced from non-Y chromosomes in SRY vs. flow-sorting) is an important finding. However, I'd still argue comparing assemblies from the two different Y haplogroups is a stretch. The authors could have performed the same assembly length comparison on the T2T-Y using results from their SRY sorted reads with Verkko of HG002 vs. Verkko assembly using trio-binned markers. 7. In the section where assemblies are compared, the authors point to Table 1, which contains results from HG01109. HG01109 has never been mentioned before. I thought the authors were comparing assemblies from SRY sorted reads of HX1? I am not sure why the authors suddenly added a 3rd PUR genome with no context. Was this a mistake? Add results from HX1 to Table 1. 8. Please add divider lines in Table 1 between All / Ampliconic / X-degenerate / X-transposed / PAR / Het / Others. It is hard to see which rows belong to which category. 9. The last result section where authors compare results from Verkko, it is unclear how the verkko assembly was run. The authors say "default option", and later "in trio mode" in the methods. Did the authors collect parental reads from HG002 (HG003 and HG004)? How was "trio mode" performed? Did the authors used trio binning to sort the reads, then run Verkko? Or used the homopolymer compressed parental kmers and used that in the Rukki step of Verkko (and this should be benchmarked)? Was the HG002 trio assembly taken from Rautiainen et al. paper? Please clarify and add the missing parts to the main text and methods. 10. Related to the above section, it is hard to see in Fig. 4a the "two approximately 1 Mb contigs aligning to the same region of the Y chromosome". An enlarged inset of the dotplot may be helpful. Also, add legends and scale to the X and Y axis of the dotplots. 11. Note there is a mis-assembly reported on T2T-Y palindrome P5 (https://github.com/marbl/CHM13-issues/blob/main/v2.0_issues.bed), which the entire P5 should be inverted. I don't see this in the dotplots of Fig. 4. 12. In the discussion, the authors are mentioning results from the 10 trios that have been removed from the previous results. Please add the 10 trio results to the main text if it was a mistake, or remove the irrelevant results from the Discussions and Supp. Tables. 13. The authors discuss the suboptimal performance of SRY in the PAR is contributed by the restricted data types. I thought it was contributed by the lower density of the markers? The PAR parental marker density was very similar to that of autosomes, with stretches of runs of homozygosity, presumably to maintain enough homology for recombination. What was the marker density in the PAR? Was it below their 7 kmer / 1kb? 14. The authors mentioned there are no ZW genomes available to test SRY. There is a Zebra finch trio (ZW, female, bTaeGut2) and a male sample (ZZ, male, bTaeGut1) available with HiFi of the child (bTaeGut2) and Illumina of all the genomes from the Vertebrate Genomes Project (Rhie et al., Nature, 2021). Perhaps the authors could apply SRY on this individual, and compare the W chromosome results to what has been released on https://www.genomeark.org/vgp-all/Taeniopygia_guttata.html.
  
  Re-review: The authors have addressed most of my concerns. The revised manuscript reads much better than before. Regarding my last comment and response from the authors about the W chromosome, I was hoping to see comparable coverage of the W chromosome to the reference, as a proof of principle that SRY could be applied to non-human, highly diverged genomes. The assembly looks very fragmented though. Was it only the similarity to the Z chromosome that caused the fragmentation? Are there no other factors contributing to the discontinuity of the W chromosome? A few minor comments below to the revised version: 1. Please indicate which genome was compared in the legend of Supp. Table 5. 2.When using et al notations, please use the last name. Mari et al should be Serra Mari et al., Mikko et al should be Rautiainen et al. Also, Serra Mari et al is now published in Genome Biology: https://doi.org/10.1186/s13059-023-03160-z. Please update the reference. 3. There are a few grammar corrections to make.
2. GigaScience 13 May 2024
  
  in GigaScience
  
  AbstractMost of available reference genomes are lack of the sequence map of sex-limited chromosomes, that make the assemblies uncompleted. Recent advances on long reads sequencing and population sequencing raise the opportunity to assemble sex-limited chromosomes without the traditional complicated experimental efforts. We introduce a computational method that shows high efficiency on sorting and assembling long reads sequenced from sex-limited chromosomes. It will lead to the complete reference genomes and facilitate downstream research of sex-limited chromosomes.Competing Interest Statement
  
  Reviewer 2. Shilpa Garg
  
  Comments to Author: The SRY method, developed and evaluated for sorting long reads of sex-limited chromosomes, has shown promise in effectively identifying and sorting sequences based on sex-specific markers, particularly the Y chromosome. These sorted long reads are then utilized for genome assembly. Additionally, the SRY method can be used to select Y chromosome contigs from a male individual's whole genome assembly. Overall, the success of SRY in sorting and assembling long reads of sex-limited chromosomes highlights its potential as an alternative to experimental methods for studying sex-specific genomic regions. Here are some comments for further improvement of manuscript: 1) The authors may want to consider to presenting a table for standard evaluation metrics (k-mer or alignment-based). See Garg 2021 (https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02328-9).2) Adding a few important genes that are medically relevant and assembled properly may further add value to the work.
3. GigaScience 13 May 2024
  
  in GigaScience
  
  AbstractMost of available reference genomes are lack of the sequence map of sex-limited chromosomes, that make the assemblies uncompleted. Recent advances on long reads sequencing and population sequencing raise the opportunity to assemble sex-limited chromosomes without the traditional complicated experimental efforts. We introduce a computational method that shows high efficiency on sorting and assembling long reads sequenced from sex-limited chromosomes. It will lead to the complete reference genomes and facilitate downstream research of sex-limited chromosomes.
  
  This work has been published in GigaScience Journal under a CC-BY 4.0 license (https://doi.org/10.1093/gigascience/giae015), and has published the reviews under the same license. These are as follows.
  
  Reviewer 1: Zuyao Liu, Ph.D
  
  Comments to Author: The authors have introduced a novel bioinformatic approach for sex chromosome assembly, addressing a persistently challenging problem in genomics. This method harnesses the full potential of whole-genome resequencing data without necessitating supplementary experimental procedures, rendering it applicable to a wide array of non-model species. Notably, the method exhibits robustness when applied to human data, surpassing established techniques such as flow-sorting and trio-binning. While the manuscript exhibits promise, several key aspects warrant refinement and elucidation to bolster its consideration for publication in GigaScience. 1. Language Polishing: A degree of language refinement is advisable to enhance the overall clarity and professionalism of the manuscript. 2. Y Chromosome Assembly Discrepancy: The authors should acknowledge and provide an explanation for the substantial difference between the length of the latest Y chromosome assembly from T2T (~62Mb) and the assembly from SRY with Verkko (~23Mb), as detailed in Table 1. 3. Y Chromosome Completeness: In cases where the Y chromosome assembly is incomplete, the inclusion of a figure or table delineating the proportion that SRY can recover in distinct regions of the Y chromosome would be beneficial. This could facilitate a comparative analysis of the method's efficacy across different regions. 4. Figure 4 Clarity: It is imperative to label the coordinates on both the X and Y axes in Figure 4 to enhance clarity. While Figure 4 suggests that the assembly from SRY is complete compared to T2T-CHM13, the total length of the SRY assembly (approximately 23Mb) should be clearly reconciled with this observation. 5. Table 1 Organization: The organization of Table 1 should be improved to enhance readability and comprehensibility. 6. MSK-Based Read Filtering: Authors should explicitly address the potential exclusion of reads from Y regions with lower than average MSK, especially in species with both young and old parts on Y chromosomes. If possible, provide recommendations or strategies for rescuing such reads. 7. Simulation for species with young sex chromosomes: It is essential to conduct additional simulations for testing the efficiency of isolating Y reads for species with young sex chromosomes. This analysis should consider the variation between X and Y chromosomes, aiding researchers in evaluating the method's suitability for their specific study organisms.
  
  Addressing these points will further strengthen the manuscript's scientific rigor and its suitability for publication in GigaScience.
  
  Re-review: After reading the revised article, the questions I had previously posed were answered. I am very interested in this SRY method and believe it is also an important part of sex chromosome research. From my personal point of view, it is not easy to collect Trio data for most species except a few, but it is relatively easy to collect HIC data. It would be helpful if the authors could also compare the results of SRY HIFI with those of Hifiasm (HIC phased) to help people choose the right tool for sex chromosome assembly. However, this is not necessary, because SRY has achieved a very good result in humans. Overall, the data and results are convincing.
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2020.05.25.115592v1
www.biorxiv.org www.biorxiv.org

Streamlining remote nanopore data access with slow5curl

3
1. GigaScience 13 May 2024
  
  in GigaScience
  
  ABSTRACTAs adoption of nanopore sequencing technology continues to advance, the need to maintain large volumes of raw current signal data for reanalysis with updated algorithms is a growing challenge. Here we introduce slow5curl, a software package designed to streamline nanopore data sharing, accessibility and reanalysis. Slow5curl allows a user to fetch a specified read or group of reads from a raw nanopore dataset stored on a remote server, such as a public data repository, without downloading the entire file. Slow5curl uses an index to quickly fetch specific reads from a large dataset in SLOW5/BLOW5 format and highly parallelised data access requests to maximise download speeds. Using all public nanopore data from the Human Pangenome Reference Consortium (>22 TB), we demonstrate how slow5curl can be used to quickly fetch and reanalyse signal reads corresponding to a set of target genes from each individual in large cohort dataset (n = 91), minimising the time, egress costs, and local storage requirements for their reanalysis. We provide slow5curl as a free, open-source package that will reduce frictions in data sharing for the nanopore community: https://github.com/BonsonW/slow5curlCompeting Interest StatementI.W.D. manages a fee-for-service sequencing facility at the Garvan Institute of Medical Research that is a customer of Oxford Nanopore Technologies but has no further financial relationship. H.G., J.M.F. and I.W.D. have previously received travel and accommodation expenses from Oxford Nanopore Technologies. The authors declare no other competing financial or non-financial interests.
  
  Reviewer 3. Guillermo Dufort y Alvarez
  
  This paper introduces slow5curl, a software tool that extends slow5lib, a previous tool developed by the authors. The tool allows users to retrieve raw nanopore sequencing reads from remote BLOW5 files, a novel format that has several advantages over FAST5 and POD5, which are the most widely used formats for storing raw nanopore sequencing data. BLOW5 is not yet a standard format, but this tool could encourage its adoption and the development of similar tools in the future. The paper is well written, clear, and concise, and the tool is tested on various scenarios. The GitHub repository provides clear instructions and examples for building and using the tool. My comments to the authors are: Major 1. I am concerned about the main use case of the tool, which is to obtain a subset of raw nanopore reads that align to a specific region (e.g., a gene), in order to re-basecall them with a new software tool. This assumes that the alignment region of the original basecall is consistent with the new basecall, which may not be true. The new basecall sequences may align better to a different region, and some sequences that were not retrieved may align well to the desired region. This affects the precision and recall of the process. I would like the authors to address this issue, by either providing evidence that this is rare, or explaining why the tool is still useful despite this limitation. 2. The tool depends on the availability of a BAM file for the raw reads, which is uploaded along with the BLOW5 file and its index. In the section Fetching reads from a large cohort, the authors claim that storing the raw nanopore data with its index reduces the size by 29.7% compared to FAST5. However, they do not consider the size of the BAM file, which is required for the main use case. I would like the authors to address this, by either reporting the size of the BAM files, or justifying why their size is irrelevant for this comparison.
  
  Minor 1. In section RESULTS, in line two, delete the repeated word simple from "simple BLOW5 simple".
  
  Re-review. The authors correctly addressed each one of the comments I made. From my side, no further changes are needed for publication. Great work.
2. GigaScience 13 May 2024
  
  in GigaScience
  
  As adoption of nanopore sequencing technology continues to advance, the need to maintain large volumes of raw current signal data for reanalysis with updated algorithms is a growing challenge. Here we introduce slow5curl, a software package designed to streamline nanopore data sharing, accessibility and reanalysis. Slow5curl allows a user to fetch a specified read or group of reads from a raw nanopore dataset stored on a remote server, such as a public data repository, without downloading the entire file. Slow5curl uses an index to quickly fetch specific reads from a large dataset in SLOW5/BLOW5 format and highly parallelised data access requests to maximise download speeds. Using all public nanopore data from the Human Pangenome Reference Consortium (>22 TB), we demonstrate how slow5curl can be used to quickly fetch and reanalyse signal reads corresponding to a set of target genes from each individual in large cohort dataset (n = 91), minimising the time, egress costs, and local storage requirements for their reanalysis. We provide slow5curl as a free, open-source package that will reduce frictions in data sharing for the nanopore community: https://github.com/BonsonW/slow5curlCompeting Interest Statement
  
  Reviewer 2. Yunfan Fan
  
  Comments to Author: In this manuscript, the authors demonstrate a highly streamlined method for downloading targeted subsets of raw ONT electrical signals, for re-analysis. In my view, this will be a highly useful tool for researchers working with public nanopore data, and I hope to see its widespread adoption. The benchmarks are well-described in the manuscript, and the code is publicly available and well-documented. I have no other notes or suggestions for the authors.
3. GigaScience 13 May 2024
  
  in GigaScience
  
  ABSTRACTAs adoption of nanopore sequencing technology continues to advance, the need to maintain large volumes of raw current signal data for reanalysis with updated algorithms is a growing challenge. Here we introduce slow5curl, a software package designed to streamline nanopore data sharing, accessibility and reanalysis. Slow5curl allows a user to fetch a specified read or group of reads from a raw nanopore dataset stored on a remote server, such as a public data repository, without downloading the entire file. Slow5curl uses an index to quickly fetch specific reads from a large dataset in SLOW5/BLOW5 format and highly parallelised data access requests to maximise download speeds. Using all public nanopore data from the Human Pangenome Reference Consortium (>22 TB), we demonstrate how slow5curl can be used to quickly fetch and reanalyse signal reads corresponding to a set of target genes from each individual in large cohort dataset (n = 91), minimising the time, egress costs, and local storage requirements for their reanalysis. We provide slow5curl as a free, open-source package that will reduce frictions in data sharing for the nanopore community: https://github.com/BonsonW/slow5curl
  
  This work has been published in GigaScience Journal under a CC-BY 4.0 license (https://doi.org/10.1093/gigascience/giae016), and has published the reviews under the same license. These are as follows.
  
  Reviewer 1: Jan Voges
  
  Comments to Author: The manuscript provides a detailed overview of the proposed technology, with an emphasis on reproducibility through precise software version and command line documentation. Although slow5curl is a rather simple implementation of a curl-based streaming for nanopore data, it is extensively evaluated. In this way, its value to the nanopore community is made clear. I do have a few minor comments:
  
  Introduction: The importance of preserving raw signal data needs to be more clearly articulated. There is a view within the community that reads that have undergone high-accuracy base calling and methylation calling are sufficient for distribution and long-term storage. The clarification on the importance of raw data retention would strengthen the introduction.
  
  Results: Please rephrase "[…]fetch a specific read(s) […]". Results: It should be stated more explicitly that BLOW5 is a compressed data representation and therefore suitable for streaming. Results: "The simple BLOW5 simple file-structure[…]" -> "The simple BLOW5 file structure […]"
  
  Discussion: "[…] users must upload a single FAST5 tarball for a given datasets" -> "[…] users must upload a single FAST5 tarball for a given dataset" Discussion: While the SLOW5 ecosystem is described in detail, it would be beneficial to discuss whether there are any alternative solutions or technologies that provide a comparative perspective. Discussion: It would be interesting to discuss the possible standardization of the SLOW5 ecosystem. What is the vision? An academically centered open-source ecosystem? A proprietary system? A more "formal" standard (GA4GH, ISO/IEC)?
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2023.11.28.569128v1
www.biorxiv.org www.biorxiv.org

On the variability of dynamic functional connectivity assessment methods

2
1. GigaScience 13 May 2024
  
  in GigaScience
  
  Dynamic functional connectivity (dFC) has become an important measure for understanding brain function and as a potential biomarker. However, various methodologies have been developed for assessing dFC, and it is unclear how the choice of method affects the results. In this work, we aimed to study the results variability of commonly-used dFC methods. We implemented seven dFC assessment methods in Python and used them to analyze fMRI data of 395 subjects from the Human Connectome Project. We measured the pairwise similarity of dFC results using several similarity metrics in terms of overall, temporal, spatial, and inter-subject similarity. Our results showed a range of weak to strong similarity between the results of different methods, indicating considerable overall variability. Surprisingly, the observed variability in dFC estimates was comparable to the expected natural variation over time, emphasizing the impact of methodological choices on the results. Our findings revealed three distinct groups of methods with significant inter-group variability, each exhibiting distinct assumptions and advantages. These findings highlight the need for multi-analysis approaches to capture the full range of dFC variation. They also emphasize the importance of distinguishing neural-driven dFC variations from physiological confounds, and developing validation frameworks under a known ground truth. To facilitate such investigations, we provide an open-source Python toolbox that enables multi-analysis dFC assessment. This study sheds light on the impact of dFC assessment analytical flexibility, emphasizing the need for careful method selection and validation, and promoting the use of multi-analysis approaches to enhance reliability and interpretability of dFC studies.Competing Interest StatementThe authors have declared no competing interest.
  
  Reviewer 2. Nicolas Farrugia
  
  Comments to Author: Summary of review This paper fills a very important gap in the literature investigating time-varying functional connectivity (or dynamic functional connectivity, dFC), by measuring analytical flexibility of seven different dFC methods. An impressive amount of work has been put up to generate a set of convincing results, that essentially show that the main object of interest of dFC, which is the temporal variability of connectivity, cannot be measured with a high consistency, as this variability is of the same order of magnitude or even higher than the changes observed across different methods on the same data. In this very controversial field, it is very remarkable to note that the authors have managed to put together a set of analysis to demonstrate this in a very clear and transparent way. The paper is very well written, the overall approach is based on a few assumptions that make it possible to compare methods (e.g. subsampling of temporal aspects of some methods, spatial subsampling), and the provided analysis is very complete. The most important results are condensed in a few figures in the main manuscript, which is enough to convey the main messages. The supplementary materials provide an exhaustive set of additional results, which are shortly discussed one by one. Most importantly, the authors have provided an open source implementation of 7 main dfc methods. This is very welcome for the community and for reproductibility, and is of course particularly suited for this kind of contribution. A few suggestions follow. Clarification questions and suggestions : 1- How was the uniform downsampling of 286 ROI to 96 done ? Uniform in which sense ? According to the RSN ? Were ROIs regrouped with spatial contiguity ? I understand this was done in order to reduce computational complexity and to harmonize across methods, but the manuscript would benefit from having an added sentence to explain what was done. 2- Table A in figure 1 shows the important hyperparameters (HP) for each method, but the motivations regarding the choice of HP for each method is only explained in the discussion (end of page 11, "we adopted the hyperparameter values recommended by the original paper or consensus among the community for each method"). It would be better to explain it in the methods, and then only discuss why this can be a limitation, in the discussion. 3- The github repository https://github.com/neurodatascience/dFC/tree/main does not reference the paper 4- The github repository https://github.com/neurodatascience/dFC/tree/main is not documented enough. There are two very large added values in this repo : open implementation of methods, and analytical flexibility tools. The demo notebook shows how to use the analytical flexibility tools, but the methods implementation is not documented. I expect that many people will want to perform analysis using the methods as well as comparison analysis, so the documentation of individual methods should not be minimized. 5 - For the reader, it would be better to include early in the manuscript (in the introduction) the presence of the code for reproductibility. Currently, the toolbox is only introduced in the final paragraph of the discussion. It comes as a very nice suprise when reading the manuscript in full, but I think the manuscript would gain a lot of value if this paragraph was included earlier, and if the development of the toolbox was included much earlier (ie. in the abstract). 6 - We have published two papers on dFC that the authors may want to include, although these papers have investigated cerebello-cerebral dFC using whole brain + cerebellum parcellations. The first paper used continuous HMM on healthy subjects, and found correlations with impulsivity scores, while the second papers used network measures on sliding window dFC matrices on a clinical cohort (patients with alcohol use disorder). I am not sure why the authors have not found our papers in their litterature, but maybe it would be good to include them. Authors need to update the final table in supplementary materials as well as the citations in the main paper. Abdallah, M., Farrugia, N., Chirokoff, V., & Chanraud, S. (2020). Static and dynamic aspects of cerebro-cerebellar functional connectivity are associated with self-reported measures of impulsivity: A resting-state fMRI study. Network Neuroscience, 4(3), 891-909. Abdallah, M., Zahr, N. M., Saranathan, M., Honnorat, N., Farrugia, N., Pfefferbaum, A., Sullivan, E. & Chanraud, S. (2021). Altered cerebro-cerebellar dynamic functional connectivity in alcohol use disorder: a resting-state fMRI study. The Cerebellum, 20, 823-835. Note that in Abdallah et al. (2020), while we did not compare HMM results with other dFC methods, we did investigate the influence of HMM hyperparameters, as well as perform internal cross validation on our sample + null models of dFC.
  
  Minor comments 6 - "[..] what lies behind the of methods. Instead, they reveal three groups of methods, 720 variations in dynamic functional connectivity?. " -> an extra "." was added (end of page 10).
2. GigaScience 13 May 2024
  
  in GigaScience
  
  AbstractDynamic functional connectivity (dFC) has become an important measure for understanding brain function and as a potential biomarker. However, various methodologies have been developed for assessing dFC, and it is unclear how the choice of method affects the results. In this work, we aimed to study the results variability of commonly-used dFC methods. We implemented seven dFC assessment methods in Python and used them to analyze fMRI data of 395 subjects from the Human Connectome Project. We measured the pairwise similarity of dFC results using several similarity metrics in terms of overall, temporal, spatial, and inter-subject similarity. Our results showed a range of weak to strong similarity between the results of different methods, indicating considerable overall variability. Surprisingly, the observed variability in dFC estimates was comparable to the expected natural variation over time, emphasizing the impact of methodological choices on the results. Our findings revealed three distinct groups of methods with significant inter-group variability, each exhibiting distinct assumptions and advantages. These findings highlight the need for multi-analysis approaches to capture the full range of dFC variation. They also emphasize the importance of distinguishing neural-driven dFC variations from physiological confounds, and developing validation frameworks under a known ground truth. To facilitate such investigations, we provide an open-source Python toolbox that enables multi-analysis dFC assessment. This study sheds light on the impact of dFC assessment analytical flexibility, emphasizing the need for careful method selection and validation, and promoting the use of multi-analysis approaches to enhance reliability and interpretability of dFC studies.
  
  This work has been published in GigaScience Journal under a CC-BY 4.0 license (https://doi.org/10.1093/gigascience/giae009), and has published the reviews under the same license. These are as follows.
  
  Reviewer 1: Yara Jo Toenders
  
  Comments to Author: The authors performed an in-depth comparison of 7 dynamic functional connectivity methods. The paper includes many figures that are greatly appreciated as they clearly demonstrate the findings. Moreover, the authors developed a Python toolbox to implement these 7 methods. The results showed that the results were highly variable, although three clusters of similar methods could be detected. However, after reading the manuscript, there are some remaining questions. - The TR and timepoints of the fMR images are shown, but other acquisition parameters such as the voxel size are missing. Could all acquisition parameters please be provided? - Could more information be provided on the downsampling of the 286 to 96 ROIs? How was this done and what were the 96 ROIs that were created? - In the results it is explained that the definition of groups depended on the cutoff value of the clustering, however it is unclear how the cutoff value was determined. Could the authors elucidate this how this was done? - The difference between the subplots in Figure 3 is a bit difficult to understand because the labels of the different methods switch places. Perhaps the same colour could be used for the cluster of the continuous HMM, Clustering and Discrete HMM method to increase readability? - Figure 4b shows that the default mode network is more variable over methods than time, while the auditory and visual are not. Could the authors explain what may underlie this discrepancy? - From the introduction it became clear that many studies have used dFC to study clinical populations, while I understand that no single recommendation can be given, not every clinical study might have the capacity to use all 7 methods. What would the authors recommend these clinical studies? Would there for example be a method that would be recommended within each of the three clusters? - It could be helpful if the authors create DOIs for their toolbox code bases that could be cited in a manuscript, rather than linking to bare GitHub URLs. One potentially useful guide is: https://guides.github.com/activities/citable-code/
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2023.07.13.548883v2
www.biorxiv.org www.biorxiv.org

Pangenome databases provide superior host removal and mycobacteria classification from clinical metagenomic data

2
1. GigaScience 13 May 2024
  
  in GigaScience
  
  Background Culture-free real-time sequencing of clinical metagenomic samples promises both rapid pathogen detection and antimicrobial resistance profiling. However, this approach introduces the risk of patient DNA leakage. To mitigate this risk, we need near-comprehensive removal of human DNA sequence at the point of sequencing, typically involving use of resource-constrained devices. Existing benchmarks have largely focused on use of standardised databases and largely ignored the computational requirements of depletion pipelines as well as the impact of human genome diversity.Results We benchmarked host removal pipelines on simulated Illumina and Nanopore metagenomic samples. We found that construction of a custom kraken database containing diverse human genomes results in the best balance of accuracy and computational resource usage. In addition, we benchmarked pipelines using kraken and minimap2 for taxonomic classification of Mycobacterium reads using standard and custom databases. With a database representative of the Mycobacterium genus, both tools obtained near-perfect precision and recall for classification of Mycobacterium tuberculosis. Computational efficiency of these custom databases was again superior to most standard approaches, allowing them to be executed on a laptop device.Conclusions Nanopore sequencing and a custom kraken human database with a diversity of genomes leads to superior host read removal from simulated metagenomic samples while being executable on a laptop. In addition, constructing a taxon-specific database provides excellent taxonomic read assignment while keeping runtime and memory low. We make all customised databases and pipelines freely available.Competing Interest StatementThe authors have declared no competing interest.
  
  Reviewer 2. Darrin Lemmer, M.S.
  
  Comments to Author: This paper describes a method for improving the accuracy and efficiency of extracting a pathogen of interest (M. tuberculosis in this instance, though the methods should work equally well for other pathogens) from a "clinical" metagenomic sample. The paper is well written and provides links to all source code and datasets used, which were well organized and easy to understand. The premise – that using a pangenome database improves classification -- seems pretty intuitive, but it is nice to see some benchmarking to prove it. For clarity I will arrange my comments by the three major steps of your methods: dataset generation, human read removal, and Mycobacterium read classification. 1. Dataset generation -- I appreciate that you used a real-world study (reference #8) to approximate the proportions of organisms in your sample, however I am disappointed that you generated exactly one dataset for benchmarking. Even if you use the exact same community composition, there is a level of randomness involved in generating sequencing reads, and therefore some variance. I would expect to see multiple generations and an averaging of the results in the tables, however with a sufficiently high read depth, the variance won't likely change your results much, so it would be nice, and more true to real sequencing data, to vary the number of reads generated (I didn't see where you specified to what read depth for each species you generated the reads for), as it is rare in the real world to always get this deep of coverage. Ideally it would also be nice to see datasets varying the proportions of MTBC in the sample to test the limits of detection, but that may be beyond the scope of this particular paper. 2. Human read removal -- The data provided do not really support the conclusion, as all methods benchmarked performed quite well and, particularly when using the long reads from the Nanopore simulated dataset, fairly indistinguishable with the exception of HRRT. The short Illumina reads show a little more separation between the methods, probably due to the shorter sequences being able to align to multiple sequences in the reference databases, however comparing kraken human to kraken HPRC still shows very little difference, thus not supporting the conclusion that the pangenome reference provides "superior" host removal. The run times and memory used do much more to separate the performance of the various methods, and particularly with the goal of being able to run the analysis on a personal computer where peak memory usage is important. The only methods that perform well within the memory constraints of a personal computer for both long reads and short leads are HRRT and the two kraken methods, with kraken being superior at recall, but again, kraken human and kraken HPRC are virtually indistinguishable, making it hard to justify the claim that the pangenome is superior. Also, it appears your run time and peak memory usage is again based on one single data point, these should be performed multiple times and averaged. Finally, as an aside, I did find it interesting and disturbing that HRRT had such a high false negative rate compared to the other methods, given that this is the primary method used by NCBI for publishing in the SRA database, implying there are quite a few human remaining in SRA. 3. Mycobacterium read classification -- Here we do have some pretty good support for using a pangenome reference database, particularly compared to the kraken standard databases, though as mentioned previously, a single datapoint isn't really adequate, and I'd like to see both multiple datasets and multiple runs of each method. Additionally, given the purpose here is to improve the amount of MTB extracted from a metagenomic sample, these data should be taken the one extra step to show the coverage breadth and depth of the MTB genome provided by the reads classified as MTB, as a high number of reads doesn't mean much if they are all stacked at the same region of the genome. Given that these are simulated reads, which tend to have pretty even genome coverage, this may not show much, however it is still an important piece to show the value of your recommended method. One final comment is that it should be fairly easy to take this beyond a theoretical exercise, by running some actual real world datasets through the methods you are recommending to see how well they perform in actuality. For instance, reference #8, which you used as a basis for the composition of your simulated metagenomic sample, published their actual sequenced sputum samples. It would be easy to show if you can improve the amount of Mycobacterium extracted from their samples over the methods they used, thus showing value to those lower income/high TB burden regions where whole metagenome sequencing may be the best option they have.
  
  Re-review.
  
  This is a significantly stronger paper than originally submitted. I especially appreciate that multiple runs have now been done with more than one dataset, including a "real" dataset, and the analysis showing the breadth and depth of coverage of the retained Mtb reads, proving that you can still generally get a complete genome of a metagenomic sample with these methods. However kraken's low sensitivity when using the standard database definitely impacts the results, making a stronger argument for using a pangenome database (Kraken-Standard can identify the presence of Mtb, but if you want to do anything more with it, like AMR detection, you would need to use a pangenome database). I really think that this should be emphasized more, and perhaps some or all of the data in tables S9-S12 be brought into the main paper. It is maybe worth noting, that the significant drop in breadth, I would imagine, is a result of dividing the total size of the aligned reads by the size of the genome, implying a shallow coverage, but the reality is still high coverage in the areas that are covered, but lots of complete gaps in coverage. I did also like the switch to the somewhat more standard sensitivity/specificity metrics, though I do lament the actual FN/FP counts being relegated to the supplemental tables, as I thought these numbers valuable (or at least interesting) when comparing the results of the various pipelines, particularly with human read removal, where the various pipelines perform quite similarly.
2. GigaScience 13 May 2024
  
  in GigaScience
  
  Background Culture-free real-time sequencing of clinical metagenomic samples promises both rapid pathogen detection and antimicrobial resistance profiling. However, this approach introduces the risk of patient DNA leakage. To mitigate this risk, we need near-comprehensive removal of human DNA sequence at the point of sequencing, typically involving use of resource-constrained devices. Existing benchmarks have largely focused on use of standardised databases and largely ignored the computational requirements of depletion pipelines as well as the impact of human genome diversity.Results We benchmarked host removal pipelines on simulated Illumina and Nanopore metagenomic samples. We found that construction of a custom kraken database containing diverse human genomes results in the best balance of accuracy and computational resource usage. In addition, we benchmarked pipelines using kraken and minimap2 for taxonomic classification of Mycobacterium reads using standard and custom databases. With a database representative of the Mycobacterium genus, both tools obtained near-perfect precision and recall for classification of Mycobacterium tuberculosis. Computational efficiency of these custom databases was again superior to most standard approaches, allowing them to be executed on a laptop device.Conclusions Nanopore sequencing and a custom kraken human database with a diversity of genomes leads to superior host read removal from simulated metagenomic samples while being executable on a laptop. In addition, constructing a taxon-specific database provides excellent taxonomic read assignment while keeping runtime and memory low. We make all customised databases and pipelines freely available.
  
  This work has been published in GigaScience Journal under a CC-BY 4.0 license (https://doi.org/10.1093/gigascience/giae010), and has published the reviews under the same license. These are as follows.
  
  Reviewer 1: Ben Langmead
  
  Reviewer Comments to Author: Mycobacterium tuberculosis is a leading cause of death and is increasingly investigated using DNA sequencing of e.g. sputum. However sequencing data has to be handled carefully to remove reads from the human host and to correct classify the M. tuberculosis reads. The authors focus on this problem of computationally extracting only the M. tuberculosis reads from a simulated human sample, measuring the accuracy and time/memory resources used by different approaches. They begin by focusing on removing human reads using different alignment and k-mer-based tools. The authors discover, interestingly, that using a Kraken database over all the HPRC genomes leads to the best balance of resources and accuracy. Next, the authors focus on classifying the remaining reads with various databases, some general across the tree-of-life and some limited to M. tuberculosis. For this task, the authors identify the custom Mycobacterium databases as being the best choice to correctly identify tuberculosis efficiently. The paper is very clear and well written.
  
  Major Comments: 1. While the host-depletion and mycobacterium classification do tell us something, some of the numbers are quite small, leading me to wonder how robust the results are. The question lingers: should we really be making dedicions based on a simulation study where results are similar out to the fifth decimal point? There is definitely information here, but additionally evaluating real datasets or still-larger simulated ones could make the results more actionable. 2. In the Mycobacterium experiment, it does not seem approriate to use the reads not classified by Kraken as the inputs, since Kraken is what is being benchmarked in the first place. Given this is a simulation data, an alternative would be to use true non-human reads as the input. 3. The Discussion could be improved with some discussion of whether these approaches could generalize to other taxa, or other host/pathogen combinations. 4. In Table 3, the authors point out that minimap2 is the only tool to misclassify Mycobacterium reads as Human. Relatively, it has the most FPs compared to the other tools. Did the authors investigate where those alignments fell within CHM13v2? it's mentioned that Hostile uses minimap2 but with extra filtering, so I'm surprised it only has 4 FPs. Does that make sense given the specific filtering steps it performs after alignment? 5. I may have missed it, but the authors should characterize the error rate for the simulated ONT and Illumina reads somewhere. Saying the "default model" is used doesn't help the reader to understand the error profile. Minor Comments: 1. Kraken has substantially low recall using the standard and standard-8 databases in Tables 4 and 5. Those were the only times a tool had a recall below 97%. Is this expected? Perhaps because key strains are missing from the database? This wasn't explained. 2. The units of the peak memory in the tables are in MB, but memory thresholds are described using GB units in the paper. Consider changing tables to be in GB. 3. At the end of the host-depletion section, it's mentioned that all missed human reads (FNs) were from unplaced scaffolds. Is it known if those matches are due to contamination in the assembly? Contigs under 10 kb were filtered, but there could be contaminated contigs above that length.
  
  Re-review: The authors have substantially addressed the points I raised in my review.
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2023.09.18.558339v3
www.biorxiv.org www.biorxiv.org

Improved integration of single cell transcriptome data demonstrated on heart failure in mice and men

2
1. GigaScience 12 May 2024
  
  in GigaScience
  
  AbstractBiomedical research frequently uses murine models to study disease mechanisms. However, the translation of these findings to human disease remains a significant challenge. In order to improve the comparability of mouse and human data, we present a cross-species integration pipeline for single-cell transcriptomic assays.The pipeline merges expression matrices and assigns clear orthologous relationships. Starting from Ensembl ortholog assignments, we allocated 82% of mouse genes to unique orthologs by using additional publicly available resources such as Uniprot, and NCBI databases. For genes with multiple matches, we employed the Needleman-Wunsch global alignment based on either amino acid or nucleotide sequence to identify the ortholog with the highest degree of similarity.The workflow was tested for its functionality and efficiency by integrating scRNA-seq datasets from heart failure patients with the corresponding mouse model. We were able to assign unique human orthologs to up to 80% of the mouse genes, utilizing the known 17,492 orthologous pairs. Curiously, the integration process enabled the identification of both common and unique regulatory pathways between species in heart failure.In conclusion, our pipeline streamlines the integration process, enhances gene nomenclature alignment and simplifies the translation of mouse models to human disease. We have made the OrthoIntegrate R-package accessible on GitHub (https://github.com/MarianoRuzJurado/OrthoIntegrate), which includes the assignment of ortholog definitions for human and mouse, as well as the pipeline for integrating single cells.KeypointsNovel integration workflow for scRNA-seq data from different species in an easy to use R-package (“OrthoIntegrate”).Improved one-to-one ortholog assignment via sequence similarity scores and string similarity calculations.Validation of “OrthoIntegrate” results with a case study of snRNA-seq from human heart failure with reduced ejection fraction and its corresponding mouse model
  
  Reviewer 2: Yinqi Bai
  
  Comments to Author: Jurado et al. reported a pipeline designed to optimize the detection of orthologous genes and utilized it to enhance the integration of cross-species single-cell RNA sequencing (scRNA-seq) data. They demonstrated the effectiveness of this pipeline by comparing shared and distinct regulatory pathways between human HFrEF (Heart Failure with Reduced Ejection Fraction) patients and the corresponding mouse model. The work provided reliable results that emphasize the importance of exercising caution when using mouse models to study disease mechanisms. However, many important factors should be critically thought about and benchmarked. Here are a few major issues: 1. Ortholog identification has long been a critical and essential step for many comparative, evolutionary, and functional genomic analyses. To evaluate the performance of an orthology inference method, there are some gold standards available for benchmark testing, such as the Quest Orthology Benchmark Service (https://orthology.benchmarkservice.org). Whether OrthoIntegrate outperforms other methods should be comprehensively benchmarked on diverse datasets and metrics, rather than relying solely on the silhouette coefficient score from a heart single-cell RNA sequencing (scRNA-seq) dataset. 2. According to the authors' integration pipeline, both human and mouse scRNA-seq data are individually clustered to assign cell type labels and are then further integrated with orthologous genes for clustering to assign new labels. How do the labels for each cell and each cell type change before and after the integration approach? Does cell type assignment become more reasonable after the integration? The authors should demonstrate that the selection of orthologous genes for clustering improves the accuracy of cell type assignment. The silhouette coefficient score is not a direct metric for assessing accuracy, as it can be influenced by biological factors. For example, in Supplementary Table 3, the silhouette scores of mouse-HFrEF samples generated by Paranoid and OMA are consistently higher than those by OrthoIntegrate, which is opposite to the control groups and human-HFrEF samples. 3. The data analysis needs to be expanded further if there are findings with potential biological significance. For example, the authors mentioned, 'In cluster 25, we observe a group of genes showing increased expression in human FBs, and we also identify a set of genes that are negatively regulated in cluster 28 in human ECs.' However, there is no functional analysis, such as GO or KEGG pathway enrichment analysis, conducted to interpret the data and validate these findings. 4. The discussion section is confusing. The authors should clarify whether the paper is primarily focused on research methods or data analysis. If it is a data analysis paper, the authors should conduct additional investigations to include further data analysis. If it is a research method paper, the authors should extend the discussion to relate to the algorithm itself.
  
  Minor comments: 1. The cell number for each sample and each clustered cell type is critical for assessing the reliability of the results; however, this information is not provided in the paper. 2. As the mouse model is generated through chronic infarction, it raises the question of why very few T/B cell markers are found in immune cells in Figure 1F. Is it possible that these cell types are not adequately captured in the mouse samples? In data integration analysis, the audience may be more interested in understanding how species-specific cell types perform, particularly when, for instance, only macrophages are the dominant immune cells found in human samples. 3. On page 5, clarify "latter ones" in the sentence "Most of the latter ones were long non-coding RNAs with identical gene names." 4. On page 5, correct the reference to Supplementary Figure 4A instead of Supplementary Figure 3A and Supplementary Table 3. 5. On page 16, replace "regulated genes" with "differentially expressed genes (DEGs)" to accurately represent what the authors referred.
  
  Re-review:
  
  The author's additional analysis is commendable. With the inclusion of new evaluation metrics, the benchmark section now appears relatively comprehensive, and the explanations provided for the reduced NMI score are reasonable. In the results section, the supplementary information on functional enrichment further elucidates the biological functions of fibroblast cluster 25 and endothelial cell cluster. 28. There are still some minor suggestions for improvement: 1. The presentation of the biological findings in the discussion section could be more succinct to improve clarity. 2. There is a lack of discussion on the impact of the numerous lncRNAs generated by OrthoIntegrate. This topic requires further exploration and elaboration. 3. Reorganize the paragraphs for "Single cell pre-processing" and "Study samples" to clarify the source of the data used in the article. Emphasize the data generated by authors (E-MTAB-13264) and provide details on the single-cell sequencing process (not only the raw data pre-processing).
2. GigaScience 12 May 2024
  
  in GigaScience
  
  AbstractBiomedical research frequently uses murine models to study disease mechanisms. However, the translation of these findings to human disease remains a significant challenge. In order to improve the comparability of mouse and human data, we present a cross-species integration pipeline for single-cell transcriptomic assays.The pipeline merges expression matrices and assigns clear orthologous relationships. Starting from Ensembl ortholog assignments, we allocated 82% of mouse genes to unique orthologs by using additional publicly available resources such as Uniprot, and NCBI databases. For genes with multiple matches, we employed the Needleman-Wunsch global alignment based on either amino acid or nucleotide sequence to identify the ortholog with the highest degree of similarity.The workflow was tested for its functionality and efficiency by integrating scRNA-seq datasets from heart failure patients with the corresponding mouse model. We were able to assign unique human orthologs to up to 80% of the mouse genes, utilizing the known 17,492 orthologous pairs. Curiously, the integration process enabled the identification of both common and unique regulatory pathways between species in heart failure.In conclusion, our pipeline streamlines the integration process, enhances gene nomenclature alignment and simplifies the translation of mouse models to human disease. We have made the OrthoIntegrate R-package accessible on GitHub (https://github.com/MarianoRuzJurado/OrthoIntegrate), which includes the assignment of ortholog definitions for human and mouse, as well as the pipeline for integrating single cells.
  
  This work has been published in GigaScience Journal under a CC-BY 4.0 license (https://doi.org/10.1093/gigascience/giae011), and has published the reviews under the same license. These are as follows.
  
  Reviewer 1: Ruoyan Li
  
  Comments to Author: In the manuscript entitled 'Improved integration of single cell transcriptome data demonstrates common and unique signatures of heart failure in mice and humans', the authors developed a pipeline (OrthoIntegrate) to assign gene orthologs across species and integrate cross-species single-cell RNA-seq data based on Seurat workflows. The authors further compared OrthoIntegrate to other orthologue databases and tools methods and highlighted a better performance of their method. To illustrate the potential applications of OrthoIntegrate, the authors integrated single-cell/single-nuclei RNA-seq data from cardiac tissue of heart failure patients with reduced ejection fraction (HFrEF) and a mouse model mimicking HFrEF using the pipeline. This revealed commonly regulated genes in the disease condition between species (i.e., genes related to cardiomyocyte energy metabolism) and species-specifically regulated genes (i.e., angiogenesis-related genes in humans). Overall, this is a well-designed study with the development of a useful cross-species single-cell data integration pipeline whose applications have been showcased in the context of heart failure (to me it is more like an improved orthologue assignment method)
  
  A few points need to be addressed before publishing 1. The authors utilized the Needleman-Wunsch algorithm to generate one-to-one orthologs between human genes and mouse genes. What is the advantage of using this algorithm compared to other algorithms i.e., SAMap uses BLAST? 2. The authors have shown the application of OrthoIntegrate in the context of heart failure between mice and humans. Could the authors include at least one more example of using OrthoIntegrate in other disease conditions or between other species to show the versatility of OrthoIntegrate? 3. To assess the quality of clustering after integration, the authors calculated silhouette coefficients/scores and found that integration by OrthoIntegrate resulted in an improved clustering performance. Could the authors include more benchmarking metrics to assess the performance of OrthoIntegrate compared to other methods? The authors could consider metrics like the species mixing score used by BENGAL (Song et al., 2022, biorxiv; https://github.com/Functional-Genomics/BENGAL) 4. Miscalling of figures: silhouette coefficients are shown in Supp_Fig_4 rather than Suppl_Fig_3. 5. Some information on the used datasets in the manuscript has been shown in supplementary table 1, but it's still a bit confusing, for example, where the mouse and human HFrEF datasets come from. I am not exactly sure, but I presume HFrEF datasets are from E-MTAB-13264? This information should be described more explicitly in the method section.
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2023.03.15.532742v1
Apr 2024
www.biorxiv.org www.biorxiv.org

Multi-omic dataset of patient-derived tumor organoids of neuroendocrine neoplasms

3
1. GigaScience 08 Apr 2024
  
  in GigaScience
  
  AbstractBackground Organoids are three-dimensional experimental models that summarize the anatomical and functional structure of an organ. Although a promising experimental model for precision medicine, patient-derived tumor organoids (PDTOs) have currently been developed only for a fraction of tumor types.Results We have generated the first multi-omic dataset (whole-genome sequencing, WGS, and RNA-sequencing, RNA-seq) of PDTOs from the rare and understudied pulmonary neuroendocrine tumors (n = 12; 6 grade 1, 6 grade 2), and provide data from other rare neuroendocrine neoplasms: small intestine (ileal) neuroendocrine tumors (n = 6; 2 grade 1 and 4 grade 2) and large-cell neuroendocrine carcinoma (n = 5; 1 pancreatic and 4 pulmonary). This dataset includes a matched sample from the parental sample (primary tumor or metastasis) for a majority of samples (21/23) and longitudinal sampling of the PDTOs (1 to 2 time-points), for a total of n = 47 RNA-seq and n = 33 WGS. We here provide quality control for each technique, and provide the raw and processed data as well as all scripts for genomic analyses to ensure an optimal re-use of the data. In addition, we report somatic small variant calls and describe how they were generated, in particular how we used WGS somatic calls to train a random-forest classifier to detect variants in tumor-only RNA-seq.Conclusions This dataset will be critical to future studies relying on this PDTO biobank, such as drug screens for novel therapies and experiments investigating the mechanisms of carcinogenesis in these understudied diseases.
  
  A version of this preprint has been published in the Open Access journal GigaScience (see https://doi.org/10.1093/gigascience/giae008 ), where the paper and peer reviews are published openly under a CC-BY 4.0 license.
  
  Reviewer Qiuyue Yuan
  
  The authors conducted a study where they generated multi-omics datasets, including whole-genome sequencing and RNA sequencing , for rare neuroendocrine tumors in the lungs, small intestine, and large cells.
  
  They used patient-derived tumor organoids and performed quality control analysis on the datasets. Additionally, they developed a random forest classifier specifically for detecting mutations in the RNA-seq data.
  
  The pipeline used in this study is well-organized, but I have a few queries that I would like to clarify before recommending it for publication.Major concerns:The data processing and quality control procedures would be valuable for other researchers working with similar datasets. It would be beneficial to add these procedures to the GitHub repository (https://github.com/IARCbioinfo/MS_panNEN_organoids).
  
  Furthermore, it would be helpful to provide insights into what constitutes good quality reads, such as the number of unique reads and the ratio of duplicate reads.Regarding the random forest (RF) model, it is mentioned that there are 10 features. Could you clarify if these features are from the public information, or are all the features extracted solely from the RNA-seq data?
  
  Also, does the RF model work for WGS data as well?Was there any specific design implemented to address the issue of imbalanced positive and negative samples?RNA-seq are not used to generate the gene expression here, which would waste important information.Minor concerns:In Figure 6C, what does "Mean minimum depth" refer to?Is the most important feature identified by the RF model a good predictor?
2. GigaScience 08 Apr 2024
  
  in GigaScience
  
  Background Organoids are three-dimensional experimental models that summarize the anatomical and functional structure of an organ. Although a promising experimental model for precision medicine, patient-derived tumor organoids (PDTOs) have currently been developed only for a fraction of tumor types.Results We have generated the first multi-omic dataset (whole-genome sequencing, WGS, and RNA-sequencing, RNA-seq) of PDTOs from the rare and understudied pulmonary neuroendocrine tumors (n = 12; 6 grade 1, 6 grade 2), and provide data from other rare neuroendocrine neoplasms: small intestine (ileal) neuroendocrine tumors (n = 6; 2 grade 1 and 4 grade 2) and large-cell neuroendocrine carcinoma (n = 5; 1 pancreatic and 4 pulmonary). This dataset includes a matched sample from the parental sample (primary tumor or metastasis) for a majority of samples (21/23) and longitudinal sampling of the PDTOs (1 to 2 time-points), for a total of n = 47 RNA-seq and n = 33 WGS. We here provide quality control for each technique, and provide the raw and processed data as well as all scripts for genomic analyses to ensure an optimal re-use of the data. In addition, we report somatic small variant calls and describe how they were generated, in particular how we used WGS somatic calls to train a random-forest classifier to detect variants in tumor-only RNA-seq.Conclusions This dataset will be critical to future studies relying on this PDTO biobank, such as drug screens for novel therapies and experiments investigating the mechanisms of carcinogenesis in these understudied diseases.
  
  A version of this preprint has been published in the Open Access journal GigaScience (see https://doi.org/10.1093/gigascience/giae008 ), where the paper and peer reviews are published openly under a CC-BY 4.0 license.
  
  Reviewer Saurabh V Laddha
  
  Alcala et al., did an excellent work on rare cancer type by creating PDTOs molecular fingerprint which has a direct impact for researcher working on these rare cancer type. As a data note, this is excellent resource and covering huge gap in this rare cancer field.These PDTOs holds high impact specially for such cancers which are slow growing and not easy culture in lab. Authors covered details regarding each technique used in this study and figures are clear to understand with exceptional writing.Minor comments:- Did authors compare the PDTOs to tumor molecular dataset ? This will be the key to understand how closely and qualitatively PTDOs are related to actual tumor datasets molecular profile. It is not clear in the current version and it will be helpful to readers to decide whether PTDOs molecular fingerprint system are valuable to them. This is not required for this manuscript to address but a note will be helpful to make valulabe decision to use such resources and with what limitations.- Authors covered longitudinal samples in this system for 1 to 2 timepoints. What changes did they observe (molecularly) looking at this data from a longitudinal timepoints view will be helpful for readers. Also, based on author's experience for longitudinal sampling, do authors have key suggestions for researcher ? a brief discussion will be helpful.- Authors did comprehensive small variant analysis from WGS and RNAseq. Did you authors find known somatic variations for these samples ? mainly comparing against the known published mutational landscape. A note of this will be helpful.- A comment on limitations of PTDOs and molecular fingerprint created from such PDTOs will be valuable.- Authors briefly comment on using such molecular datasets from PDTOs and combining with other datasets to improve on power statistics to discover informative molecular features of these cancers. This points towards my first point on how similar PDTOs are to tumor molecular profile.
3. GigaScience 08 Apr 2024
  
  in GigaScience
  
  Background Organoids are three-dimensional experimental models that summarize the anatomical and functional structure of an organ. Although a promising experimental model for precision medicine, patient-derived tumor organoids (PDTOs) have currently been developed only for a fraction of tumor types.Results We have generated the first multi-omic dataset (whole-genome sequencing, WGS, and RNA-sequencing, RNA-seq) of PDTOs from the rare and understudied pulmonary neuroendocrine tumors (n = 12; 6 grade 1, 6 grade 2), and provide data from other rare neuroendocrine neoplasms: small intestine (ileal) neuroendocrine tumors (n = 6; 2 grade 1 and 4 grade 2) and large-cell neuroendocrine carcinoma (n = 5; 1 pancreatic and 4 pulmonary). This dataset includes a matched sample from the parental sample (primary tumor or metastasis) for a majority of samples (21/23) and longitudinal sampling of the PDTOs (1 to 2 time-points), for a total of n = 47 RNA-seq and n = 33 WGS. We here provide quality control for each technique, and provide the raw and processed data as well as all scripts for genomic analyses to ensure an optimal re-use of the data. In addition, we report somatic small variant calls and describe how they were generated, in particular how we used WGS somatic calls to train a random-forest classifier to detect variants in tumor-only RNA-seq.Conclusions This dataset will be critical to future studies relying on this PDTO biobank, such as drug screens for novel therapies and experiments investigating the mechanisms of carcinogenesis in these understudied diseases.
  
  A version of this preprint has been published in the Open Access journal GigaScience (see https://doi.org/10.1093/gigascience/giae008 ), where the paper and peer reviews are published openly under a CC-BY 4.0 license.
  
  Reviewer Masashi Fujita
  
  In this manuscript, Alcala et al. have reported on the whole genome sequencing (WGS) and RNA sequencing (RNA-seq) of 23 patient-derived tumor organoids of neuroendocrine neoplasms.
  
  This is a detailed report on the quality control of WGS, RNA-seq, and sample swap. The methods are solid and well-described. The raw sequencing data have been deposited in a public repository. This dataset could be a valuable resource for exploring the biology and treatment of this rare type of tumor.
  
  Here are my comments to the authors:
  
  Could you please clarify whether the organoids described in this manuscript will be distributed? If so, could you provide the contact address and any restrictions, such as a material transfer agreement?You have deposited the RNA-seq gene expression matrix in the public repository European Genome-phenome Archive (dataset ID: EGAD00001009994).
  
  However, the file is under controlled access. This limits the availability of data, especially for scientists who just want a quick glance at the data. Since the gene expression matrix does not contain personally identifiable information, I wonder if you could make the file open access.
  
  You have reported how you detected somatic mutations in the organoids. However, you did not share the list of detected mutations. Sharing this list would help scientists who do not have a computational background. Open access is preferable in this case, but controlled access is also acceptable because germline variants could be misclassified as somatic.
  
  The primary site of mLCNEC23 is unknown. Could you infer its primary site based on gene expression patterns or driver mutations?I have concerns about the generalizability of your random forest model because it was trained using only 22 somatic mutations. Could you assess your prediction model using publicly available datasets of cancer genomes (e.g., TCGA)?
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2023.08.31.555732v1
Mar 2024
www.biorxiv.org www.biorxiv.org

Molecular Property Diagnostic Suite for COVID-19 (MPDSCOVID-19): An open access disease specific drug discovery portal

2
1. GigaScience 18 Mar 2024
  
  in GigaByte
  
  Editors Assessment:
  
  MPDSCOVID-19 has been developed as a one-stop solution for drug discovery research for COVID-19, running on the Molecular Property Diagnostic Suite (MPDS) platform. This is built upon the open-source Galaxy workflow system, integrating many modules and data specific to COVID-19. Data integrated includes SARS-CoV-2 targets, genes and their pathway information; information on repurposed drugs against various targets of SARS-CoV-2, mutational variants, polypharmacology for COVID-19, drug-drug interaction information, Protein-Protein Interaction (PPI), host protein information, epidemiology, and inhibitors databases. After improvements to the technical description of the platform, testing helped demonstrate the potential to drive open-source computational drug discovery with the platform.
  
  This evaluation refers to version 1 of the preprint
  
  Summary
2. GigaScience 18 Mar 2024
  
  in GigaByte
  
  AbstractComputational drug discovery is intrinsically interdisciplinary and has to deal with the multifarious factors which are often dependent on the type of disease. Molecular Property Diagnostic Suite (MPDS) is a Galaxy based web portal which was conceived and developed as a disease specific web portal, originally developed for tuberculosis (MPDSTB). As specific computational tools are often required for a given disease, developing a disease specific web portal is highly desirable. This paper emphasises on the development of the customised web portal for COVID-19 infection and is referred to as MPDSCOVID-19. Expectedly, the MPDS suites of programs have modules which are essentially independent of a given disease, whereas some modules are specific to a particular disease. In the MPDSCOVID-19 portal, there are modules which are specific to COVID-19, and these are clubbed in SARS-COV-2 disease library. Further, the new additions and/or significant improvements were made to the disease independent modules, besides the addition of tools from galaxy toolshed. This manuscript provides a latest update on the disease independent modules of MPDS after almost 6 years, as well as provide the contemporary information and tool-shed necessary to engage in the drug discovery research of COVID-19. The disease independent modules include file format converter and descriptor calculation under the data processing module; QSAR, pharmacophore, scaffold analysis, active site analysis, docking, screening, drug repurposing tool, virtual screening, visualisation, sequence alignment, phylogenetic analysis under the data analysis module; and various machine learning packages, algorithms and in-house developed machine learning antiviral prediction model are available. The MPDS suite of programs are expected to bring a paradigm shift in computational drug discovery, especially in the academic community, guided through a transparent and open innovation approach. The MPDSCOVID-19 can be accessed at http://mpds.neist.res.in:8085.
  
  This work has been published in GigaByte Journal under a CC-BY 4.0 license (https://doi.org/10.46471/gigabyte.114), and has published the reviews under the same license. These are as follows.
  
  Reviewer 1. Prashanth N Suravajhala
  
  Is there a clear statement of need explaining what problems the software is designed to solve and who the target audience is? Yes. The authors could describe Minimum Information about bioinformatics investigation (MIABI) guidelines. Is the source code available, and has an appropriate Open Source Initiative license been assigned to the code? github and Zenodo, yes!
  
  I tested git, forked it and as I didn't test the graphical version, ensured all python libraries are working! Is the documentation provided clear and user friendly? Yes. Yes, a white paper could be more friendly! Is there a clearly-stated list of dependencies, and is the core functionality of the software documented to a satisfactory level? Yes. yes with README version! Have any claims of performance been sufficiently tested and compared to other commonly-used packages? Yes, as described by the authors Are there (ideally real world) examples demonstrating use of the software? Yes. The Molecular Property Dynamic Suite (MPDS) is a welcome initiative which would serve chemical space for research community. While the authors aimed to deploy it in Galaxy, there is no Galaxy reference cited in first few introductory lines. A strong rationale on Galaxy-MPDS connect could be a value addition The port 8085/8080 are ephemeral and it would be nice if the authors deploy it on a more permanent base An absolute strength for the suite is availability of source code so that end-users can fine tune and reinstantiate the codes. In the architecture, could the end user have a chance to deploy biopython modules for drug discovery/modeling
  
  In Page 4, the authors can define what are the tools precisely used in MPDS 2.3 section: The PPI is not abbreviated as first use The results are exploited well for disease dependent/independent use. However, the major challenge for ligand use/preparation is the use of ncRNAs. Could MPDS provide such instances where ncRNAs could be used fpr targeted ligands? L28 in section 4.1: Pluralis for features ( as one of is used) Also a word or two on aadhar card for perhaps naive users may be mentioned and it may be italicized as it may be a domestic word. Does MPDS suite augur well for Anvaya that Government of India launched, or Tavexa or Taverna? A word to two on local setting up of cloud instance may be a nice addition
  
  Scores on a scale of 0-5 with 5 being the best
  
  Language: 4 Novelty: 4.5 Brevity: 4 Scope and relevance: 4 Language/Brevity checks: Page 9 L6: fulfill misspelt webserver are two words, IMHO
  
  Page 10: CADD which IS available
  
  Tabl S2/S4: from THE coronavirdiae space between anticoronavirusdrugs
  
  Figure S3: remove OF (identifying OF existing) Supporting information may be corrected High resolution Figures esp GA, Figures 2-4 may be inserted
  
  Reviewer 2. Abdul Majeed
  
  Is the language of sufficient quality? Yes. Some changes are needed to make the writing more scientific. Is the code executable? Unable to test Is installation/deployment sufficiently outlined in the paper and documentation, and does it proceed as outlined? Unable to test
  
  Additional comments: In this paper, the authors introduced a Molecular Property Diagnostic Suite (MPDS), which is a Galaxy-based web portal that was conceived and developed as an open-source disease-specific web portal. MPDS is a customized web portal developed for COVID-19, which is a one-stop solution for drug discovery research. I read the article; it is well-written and well-presented. The enclosed contents can be very useful for researchers working in this field (e.g., COVID-19 systems development). However, I propose some comments/concerns to the current version that need correction during the revision. 1- In the abstract, please provide the technical description of the method’s working. Also, please mention the entities which can benefit from the system. 2- The introduction section doesn’t present the challenges/problems of the existing tools. Please discuss the challenges of the previous such tools and how are they addressed through this new system. 3- I could not find the concrete details of data modalities supported in the system. The authors are advised to include such details. 4- The authors mentioned the use of ML, but I couldn’t find any potential usage of ML models. Please add such analysis during the revision. 5- Also, please add some performance results like time complexity, storage, I/O cost, etc. 6- One comprehensive diagram should be included to better illustrate the working of the proposed tool. 7- Please add limitations of the proposed tool in the revised work. 8- Please add the potential implications of this tool in the context of current/future pandemics.
  
  Re-review: I have carefully checked the revised work and the author's responses. The authors have made the desired modifications. I have no major concerns on this paper. In the previous review round, Comment #: 3 has not been properly responded by the authors. By data modality, I meant tabular data, graph data, audio data, video data, etc. Authors should add this aspect clearly in the paper about each data modality processed in their system. In Figure 4, some contents (e.g., protein information, PPI interaction, etc.) are unreadable. The abbreviations are not consistently written in terms of small and capital letters. In the paper, the authors are advised to clearly describe the purpose of this tool, who will benefit and in what capacity, why these kinds of tools are needed, etc. I suggest adding such information in abstract to clearly convey the message to readers. In the title, please recheck one word, Open Access or Open Source. The journals are open access while the software are usually open source .
  
  Reviewer 3. Agastya P Bhati
  
  Is there a clear statement of need explaining what problems the software is designed to solve and who the target audience is? Yes. As noted in my comments, it would be beneficial to clarify what new capabilities are provided by this new portal over and above what is already available currently. Is the source code available, and has an appropriate Open Source Initiative license been assigned to the code? No. There is a github repository (https://github.com/gnsastry/MPDS-18Compound_Library), however, I am unable to access it currently. As Open Source Software are there guidelines on how to contribute, report issues or seek support on the code? Yes. A github repository provides such capabilities. However, it is inaccessible currently. Is the code executable? Unable to test Is installation/deployment sufficiently outlined in the paper and documentation, and does it proceed as outlined? Unable to test Have any claims of performance been sufficiently tested and compared to other commonly-used packages? Is automated testing used or are there manual steps described so that the functionality of the software can be verified? No Additional comments: Molecular Property Diagnostic Suite for COVID-19 (MPDSCOVID19) is an open-source disease specific web portal aiming to provide a collection of all tools and databases relevant for COVID-19 that are available online along with a few in-house scripts at a single portal. It is built upon another platform called "Galaxy" that provides similar services for data intensive biomedical research. MPDSCOVID19 is in continuation to two other similar disease-specific portals that this group has published earlier - for Tuberculosis and Diabetes mellitus. Overall, MPDSCOVID19 is an interesting and useful resource that could be helpful for biomedical community in conducting COVID-19 related research. It brings together all the databases and relevant tools that may make a researcher's life easier as exemplified through the various case studies included.
  
  I recommend publishing this article after the following revisions noted. Please note that any mention of page numbers below is referring to the reviewer PDF version.
  
  Major revisions:
  
  (1) One main issue in this manuscript is the lack of a clear description of the "new" capabilities provided by MPDSCOVID19 over and above what Galaxy provides. I think a clear distinction between the capabilities/features of Galaxy and MPDSCOVID19 would help improve the manuscript substantially and help readers better understand the capabilities of this new COVID-19 portal.
  
  Further, a description of the additions in the new portal over the earlier TB and Diabetes portals is mentioned on page 7. However, I think more details on such advancements/additions would be beneficial. It could be in the form of a table.
  
  (2) It is mentioned that a major advancement in this new portal is the inclusion of ML/AI models/approaches, however no details have been provided. It would be beneficial to briefly describe what ML based capabilities are included in MPDS and how they can be used by any general user. An additional case study demonstrating the same would be helpful.
  
  (3) MPDS portal provides a collection of tools and databases for COVID-19. However, such resources are ever-growing and hence constant updating of the portal's capabilities/resources would be a necessary requirement for its sustainability. There is no mention of any such plans. Do authors have a modus operandi for the same? Have there been further releases of the previous MPDS portals for TB and Diabetes that may be relevant here?
  
  (4) Page 6 - lines 3-4: I suggest replacing "are going to witness" with "are witnessing". There are several recent advancements in applying ML/AI based approaches to improve different aspects of drug discovery. I recommend including a few references here to this effect. Below are some relevant examples:
  
  (a) 10.1021/acs.jcim.0c00915 (b) 10.1021/acs.jcim.1c00851 (c) 10.1038/s41598-023-28785-9 (d) 10.1098/rsfs.2021.0018 (e) 10.1145/3472456.3473524 (f) 10.1145/3468267.3470573
  
  (5) Page 7 - line 8: I am assuming that the terms like "updates", "additions", etc., used in this paragraph are comparing MPDS with its older versions. If so, it would be beneficial to clarify this explicitly. In addition, I suggest that the authors include a brief literature survey to describe what other tools and/or webservers are available already and how MPDS compares with them. This has not been done so far.
  
  (6) The github repository is currently inaccessible publicly. This needs rectification.
  
  Minor revisions:
  
  (1) Page 4: Before introducing MPDSCOVID19 it makes sense to briefly describe Galaxy and its main features. For instance moving forward lines 19-20 (page 4) and lines 3-6 (page 5) to line 12 (page 4).
  
  (2) Page 5 - line 22: I suggest that authors mention the total number of databases/servers that are covered by MPDSCOVID19 as of now. From Table S1, it appears that there are 15 currently (items 5 and 7 are repeated so the 13 seems the wrong total - needs rectification).
  
  (3) Page 5 - line 30: It would make sense to specify details of the MPDS local server. For instance, how many cores/GPUs are available and what are their hardware architectures? Also, it would be beneficial for the users to know if it is possible to use MPDS tools on their own or public infrastructures for large scale implementations. I suggest authors comment on this aspect too.
  
  (4) Page 6 - lines 16-19: The sentence "Galaxy platform.......extend the availability." needs some rephrasing. It is too long and the hard to comprehend.
  
  (5) Page 7 - line 18: I don't understand the word "colloids". Please clarify.
  
  (6) Page 8 - line 30: "section 2.3" is referred to but I don't see any section numbering the PDF provided. This needs rectification.
  
  Re-review: I am satisfied with the changes made to the manuscript and recommend publishing it in its current form if all other reviewers are happy with that.
Visit annotations in context

Tags

Summary

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2023.08.29.555437v1
www.biorxiv.org www.biorxiv.org

An improved chromosome-level genome assembly of perennial ryegrass (Lolium perenne L.)

2
1. GigaScience 18 Mar 2024
  
  in GigaByte
  
  Editors Assessment:
  
  This Data Release paper presents an updated genome assembly of the doubled haploid perennial ryegrass (Lolium perenne L.) genotype Kyuss (Kyuss v2.0). To correct for structural the authors de novo assembled the genome again with ONT long-reads and generated 50-fold coverage high-throughput chromosome conformation capture (Hi-C) data to assist pseudo-chromosome construction. After being asked for some more improvements to gene and repeat annotation the authors now demonstrate the new assembly is more contiguous, more complete, and more accurate than Kyuss v1.0 and shows the correct pseudo-chromosome structure. This more accurate data have great potential for downstream genomic applications, such as read mapping, variant calling, genome-wide association studies, comparative genomics, and evolutionary biology. These future analyses being able to benefit forage and turf grass research and breeding.
  
  This evaluation refers to version 1 of the preprint
  
  Summary
2. GigaScience 18 Mar 2024
  
  in GigaByte
  
  ABSTRACTThis work is an update and extension of the previously published article “Ultralong Oxford Nanopore Reads Enable the Development of a Reference-Grade Perennial Ryegrass Genome Assembly”, by Frei et al.. The published genome assembly of the doubled haploid perennial ryegrass (Lolium perenne L.) genotype Kyuss marked a milestone for forage grass research and breeding. However, order and orientation errors may exist in the pseudo-chromosomes of Kyuss, since barley (Hordeum vulgare L.), which diverged 30 million years ago from perennial ryegrass, was used as the reference to scaffold Kyuss. To correct for structural errors possibly present in the published Kyuss assembly, we de novo assembled the genome again and generated 50-fold coverage high-throughput chromosome conformation capture (Hi-C) data to assist pseudo-chromosome construction. The resulting new chromosome-level assembly showed improved quality with high contiguity (contig N50 = 120 Mb), high completeness (total BUSCO score = 99%), high base-level accuracy (QV = 50) and correct pseudo-chromosome structure (validated by Hi-C contact map). This new assembly will serve as a better reference genome for Lolium spp. and greatly benefit the forage and turf grass research community.
  
  This work has been published in GigaByte Journal under a CC-BY 4.0 license (https://doi.org/10.46471/gigabyte.112), and has published the reviews under the same license. These are as follows.
  
  Reviewer 1. Qing Liu
  
  This updated double haploid perennial ryegrass (Lolium perenne L.) showed contig N50 of 120 Mb, total BUSCO score=99%, which verified that the improved assembly can serve a reference for Lolium species using 50-fold coverage Hi-C data. The article is well edited except for below revision points. The minor revision is suggested for the current version. 1 Please elucidate the Kyuss v2.0, whether its reference is the same as Kyuss v1.0, if same or separate reference please elucidate. 2 In Table 3 of page 6, What the repeat element number for each family, could authors listed in number and proportion in order to clear the family category, for example, is the number of rolling-circles the same for Heltrons? 3 Tandem repeat or satellite or centromere location data, could author provide for the updated assembly of the Lolium species. 4 For Figure 1, what the heterozygosity and k-mer estimated genome size, I can’t find the data. 5 In Figure 3A, lowercase letter a, b, c , d and e are suggested to subsittute the A, B, C, D and E in order to avoid Figure 3A and Figure 3AA
  
  Reviewer 2. Istvan Nagy
  
  Are all data available and do they match the descriptions in the paper? No. Minor revision in the manuscript body is suggested. Gene annotation and repeat annotation data need some minor revision) See details in the "Additional Comments" section. Additional Comments: The submitted dataset reports and improved chromosome-level assembly and annotation of the doubled-haploid line Kyuss of Lolium perenne. The present v2.0 assembly is showing significant improvements as compared to the Kyuss v1.0 assembly published by the same group in 2021: The new assembly incorporates 99% of the estimated genome size in seven pseudo-chromosomes and the >99% BUSCO completeness of the gene space is also impressive.
  
  Below are mine remarks and suggestions to the present version of manuscript:
  
  Genome assembly and polishing It's indicated that for the primary assembly of the present work the same source of ONT reads were used as for the previous Kyuss v1.0 assembly. However, in the present manuscript the authors report clearly better assembly quality as opposed to the Kyuss v1.0 assembly. The question remains open, whether the authors achieved better results by changing/optimizing the primary assembly parameters, and/or applying a step-wise, iterating strategy with repeated rounds of long-read and short-read corrections? By any means, a more detailed description/specification of assembly parameters would be desirable.
  
  Genome annotation In the provided annotation file "kyuss_v2.gff" in the majority of cases gene IDs consisting of the reference chromosome ID and of an ongoing number, like "KYUSg_chr1.188" are used. However in a few cases gene IDs like "KYUSt_contig_1275.207" are also used. This inconsistency might create confusions for future users of Kyuss_2 resources, and while the later type of gene IDs might be useful for internal usage, they became meaningless, as instead of contigs now pseudo-chromosomes (and some unplaced scaffolds) are used as references. The authors should modified the gff files and use a consistent naming scheme for all genes. Further, transcript DNA sequences as well as transcript protein sequences with consistent naming schemes should also be provided.
  
  Repeat annotation The authors should modify Table 3 by specifying and breaking down repeat categories according to the Unified Classification System of transposable elements, by giving Order and Superfamily specifications (like LTR/Gipsy and LTR/Copia etc, in accord with the provided gff file "kyuss_v2_repeatmask.gff").
  
  According to the provided repeat annotation BED file, more than 750K repeat features have been annotated on the Kyuss_2 genome. Of these repeat features 57815 are overlapping with gene features and 25843 of these overlaps are longer than 100 bp. This indicate that a substantial portion of the 38765 annotated genes might represent sequences coding for transposon proteins and/or transposon related ORFs. I suggest that the authors revise the gene annotation data (and at least remove gene annotation entries that show ~100% overlap with repeat features).
  
  Assembly quality assessment "The quality score(QV) estimated by Polca for Kyuss v2.0 was 50, suggesting a 99.999% base-level accuracy with the probability of one sequencing error per 100 kb. The estimated accuracy of Kyuss v1.0 is 99.990% (QV40, Table 1), which is 10 times lower than Kyuss v2.0, suggesting that Kyuss v2 is more accurate than Kyuss v1.0." In my opinion, this sentence needs clarification as readers might have difficulties to properly interpret this - especially considering the facts that the same long-read data was used for both for the v1 as well a for the v2 assembly versions, the short-read mapping rate was the same (99.55%) for both versions and the K-mer completeness analysis results differed only slightly (99,39% vs. 99.48%).
Visit annotations in context

Tags

Summary

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2023.12.05.570088v1
academic.oup.com academic.oup.com

Leveraging citizen science for monitoring urban forageable plants

1
1. GigaScience 18 Mar 2024
  
  in Public
  
  We believe citizen science has the potential to promote human and nature connection in urban areas and provide useful data on urban biodiversity.
  
  See more on this in GigaBlog http://gigasciencejournal.com/blog/citizen-scientists-can-see-the-wood-for-the-trees/
Visit annotations in context

Annotators

GigaScience

URL

academic.oup.com/gigascience/article/doi/10.1093/gigascience/giae007/7619370
www.biorxiv.org www.biorxiv.org

Citizen Science Data on Urban Forageable Plants: A Case Study in Brazil

2
1. GigaScience 05 Mar 2024
  
  in GigaByte
  
  Editors Assessment:
  
  This is a Data Release paper describing data sets derived from the Pomar Urbano project cataloging edible fruit-bearing plants in Brazil. Including data sourced from the citizen science iNaturalist app, tracking the distribution and monitoring of these plants within urban landscapes (Brazilian state capitals). The data was audited and peer reviewed and put into better context, and there is a companion commentary in GigaScience journal better explaining the rationale for the study. Demonstrating this data providing a platform for understanding the diversity of fruit-bearing plants in select Brazilian cities and contributing to many open research questions in the existing literature on urban foraging and ecosystem services in urban environments.
  
  This evaluation refers to version 1 of the preprint
  
  Summary
2. GigaScience 05 Mar 2024
  
  in GigaByte
  
  AbstractThis paper presents two key data sets derived from the Pomar Urbano project. The first data set is a comprehensive catalog of edible fruit-bearing plant species, native or introduced in Brazil. The second data set, sourced from the iNaturalist platform, tracks the distribution and monitoring of these plants within urban landscapes across Brazil. The study encompasses data from all 27 Brazilian state capitals, focusing on the ten cities that contributed the most observations as of August 2023. The research emphasizes the significance of citizen science in urban biodiversity monitoring and its potential to contribute to various fields, including food and nutrition, creative industry, study of plant phenology, and machine learning applications. We expect the data sets to serve as a resource for further studies in urban foraging, food security, cultural ecosystem services, and environmental sustainability.
  
  This work has been published in GigaByte Journal under a CC-BY 4.0 license (https://doi.org/10.46471/gigabyte.108 and see also the accompanying commentary in GigaScience: https://doi.org/10.1093/gigascience/giae007 ), and has published the reviews under the same license as follows:
  
  Reviewer 1. Corey T. Callaghan
  
  Are the data and metadata consistent with relevant minimum information or reporting standards?
  
  Yes. More information should be given on the relevance to GBIF. And why the dataset is necessary to 'stand alone'. The main reason I guess is because in this context cultivated organisms are really valuable as a lot of your target organisms will indeed be cultivated.
  
  Is the data acquisition clear, complete and methodologically sound?
  
  No. More detail should be provided about the difference in research grade and cultivated organisms on iNaturalist. The RG could be downloaded from GBIF, but I understand the need to go around that given that the cultivated organisms are also valuable in this context.
  
  Is the validation suitable for this type of data?
  
  No. There should be more information provided on the CV model. And more information provided on the importance of identifiers in iNaturalist ecosystem. They are critically important. Right now, it reads as if the CV model generally accurately identifies organisms, but this isn't necessarily true, and there is no reference given. However, the identifiers are necessary to help data processing and identification of the organisms submitted to iNaturalist. I also think the biases of cultivated organisms not being identified as readily by iNaturalist identifiers should be discussed somewhere in the manuscript.
  
  Additional Comments:
  
  I appreciated the description of this dataset and particularly liked the 'context' section and think it did a good job of setting up the need for such data. I would use iNaturalist throughout as opposed to iNat since iNat is a bit more colloquial.
  
  Reviewer 2. Patrick Hurley
  
  This is a very interesting paper and approach to examining questions related to the presence of edible plants in Brazilian cities. As such, it addresses--whether intentionally or not--open questions within the existing literatures of urban foraging and urban ecosystem services (Shackleton et al. 2017, ), among others, including:
  
  how the existing species composition of cities create already existing edible/useful landscapes (see Hurley et al. 2015, Hurley and Emery 2018, Hurley et al. 2022), or what the authors appear to describe as "orchards", and including the use of open data sources to support these activities (Stark et al. 2019),
  
  the ways that urban forests support cultural ecosystem services (Plieininger et al. 2015), 2a. dietary need/food security (Synk et al. 2017, Bunge et al. 2019, Gaither et al. 2020, Sardeshpande & Shackleton 2023), including in Brazil (Brito et al 2020), and diversity (Gareake & Shackleton 2020), 2b. sharing of ecological knowledge (Landor-Yamagata 2018), and 2c. social-ecological resilience (Sardeshpande et al. 2021) as well as 2d. reconnect urban residents to nature/biodiversity (Palliwoda et al. 2017, Fisher and Kowarik 2020, Schunko and Brandner 2022).
  
  I note that while most of the literatures above focus on foods and edibility, Hurley et al. 2015 and Hurley and Emery consider the relationship of urban forests for other, not food-related uses and thus the material connections and uses by people within art and other cultural objects.
  
  I also note that some scholars are beginning to focus on the question of urban governance and the inclusion of urban fruit trees (Kowalski & Conway 2023), building off of the rapidly expanding literature on urban food forestry (Clark and Nicholas 2011) and edible green infrastructure. The difference between these literatures and those I've suggested above is that they generally focus on policy and planting interventions to insert, add, or otherwise enhance the edibility of these spaces (as opposed to the above stream analyzing how people interact with what is already there, whether those species are intended for harvest by people, or not, and thus it seems like this piece better links to those issues .
  
  It would be helpful to see at least some of these links between the present research and its focus on methods for using a particularly valuable dataset linked to/with efforts to address the conceptual questions that are raised by the authors. For example, in relation to item #1 above, I might suggest dropping the use of "orchard" and describe the species being analyzed as representative of an "actually existing food forests" within these cities (building on the existing literature Items 1 through 3), while indicating the insights it might provide to those interested in interventions to shape future cities and their species composition to enhance human benefits (items 4 and 5). Likewise, it would be helpful to reference the items in 2a through 2d where they appear in the Context section, building on the very high level citations already (e.g., current citations #5 FAO and #6 Salbitano).
  
  To be clear, much of what I'm asking for here can be, I think, addressed through additions of single sentences or phrases throughout the context section, along with brief reference to these within the brief discussions under "Reuse Potenial".
  
  Or perhaps this is too in-depth for this journal. If that's the case, then I do think that reference to several key articles is needed, specifically to signal the insights this piece has for this ongoing work to understand how urban forests function for human benefit. Those would be:
  
  Shackleton et al. 2017, Hurley & Emery 2018, Garekea & Shackleton 2020, Fisher & Kowarik 2020, Sardeshpande et al. 2021.
  
  Most critically, the work of Stark et al. 2019 should be acknowledged.
  
  My sincere thanks to the authors to learn from this work and my apologies for the delay in completing this review.
  
  Works Cited Above
  
  Bunge, A., Diemont, S. A., Bunge, J. A., & Harris, S. (2019). Urban foraging for food security and sovereignty: quantifying edible forest yield in Syracuse, New York using four common fruit-and nut-producing street tree species. Journal of Urban Ecology, 5(1), juy028.
  
  Fischer, L. K., & Kowarik, I. (2020). Connecting people to biodiversity in cities of tomorrow: Is urban foraging a powerful tool?. Ecological Indicators, 112, 106087.
  
  Garekae, H., & Shackleton, C. M. (2020). Foraging wild food in urban spaces: the contribution of wild foods to urban dietary diversity in South Africa. Sustainability, 12(2), 678.
  
  Hurley, P. T., Emery, M. R., McLain, R., Poe, M., Grabbatin, B., & Goetcheus, C. L. (2015). Whose urban forest? The political ecology of foraging urban nontimber forest products. Sustainability in the global city: Myth and practice, 187-212.
  
  Hurley, P. T., & Emery, M. R. (2018). Locating provisioning ecosystem services in urban forests: Forageable woody species in New York City, USA. Landscape and Urban Planning, 170, 266-275.
  
  Hurley, P. T., Becker, S., Emery, M. R., & Detweiler, J. (2022). Estimating the alignment of tree species composition with foraging practice in Philadelphia's urban forest: Toward a rapid assessment of provisioning services. Urban Forestry & Urban Greening, 68, 127456.
  
  Kowalski, J. M., & Conway, T. M. (2023). The routes to fruit: Governance of urban food trees in Canada. Urban Forestry & Urban Greening, 86, 128045.
  
  Landor-Yamagata, J. L., Kowarik, I., & Fischer, L. K. (2018). Urban foraging in Berlin: People, plants and practices within the metropolitan green infrastructure. Sustainability, 10(6), 1873.
  
  Palliwoda, J., Kowarik, I., & von der Lippe, M. (2017). Human-biodiversity interactions in urban parks: The species level matters. Landscape and Urban Planning, 157, 394-406.
  
  Plieninger, T., Bieling, C., Fagerholm, N., Byg, A., Hartel, T., Hurley, P., ... & Huntsinger, L. (2015). The role of cultural ecosystem services in landscape management and planning. Current Opinion in Environmental Sustainability, 14, 28-33.
  
  Sardeshpande, M., Hurley, P. T., Mollee, E., Garekae, H., Dahlberg, A. C., Emery, M. R., & Shackleton, C. (2021). How people foraging in urban greenspace can mobilize social–ecological resilience during Covid-19 and beyond. Frontiers in Sustainable Cities, 3, 686254.
  
  Sardeshpande, M., & Shackleton, C. (2023). Fruits of the city: The nature, nurture and future of urban foraging. People and Nature, 5(1), 213-227.
  
  Schunko, C., & Brandner, A. (2022). Urban nature at the fingertips: Investigating wild food foraging to enable nature interactions of urban dwellers. Ambio, 51(5), 1168-1178.
  
  Shackleton, C. M., Hurley, P. T., Dahlberg, A. C., Emery, M. R., & Nagendra, H. (2017). Urban foraging: A ubiquitous human practice overlooked by urban planners, policy, and research. Sustainability, 9(10), 1884.
  
  Stark, P. B., Miller, D., Carlson, T. J., & De Vasquez, K. R. (2019). Open-source food: Nutrition, toxicology, and availability of wild edible greens in the East Bay. PLoS One, 14(1), e0202450.
  
  Synk, C. M., Kim, B. F., Davis, C. A., Harding, J., Rogers, V., Hurley, P. T., ... & Nachman, K. E. (2017). Gathering Baltimore’s bounty: Characterizing behaviors, motivations, and barriers of foragers in an urban ecosystem. Urban Forestry & Urban Greening, 28, 97-102.
Visit annotations in context

Tags

Summary

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2024.01.22.575882v1
www.biorxiv.org www.biorxiv.org

The probability of edge existence due to node degree: a baseline for network-based predictions

2
1. GigaScience 04 Mar 2024
  
  in GigaScience
  
  AbstractImportant tasks in biomedical discovery such as predicting gene functions, gene-disease associations, and drug repurposing opportunities are often framed as network edge prediction. The number of edges connecting to a node, termed degree, can vary greatly across nodes in real biomedical networks, and the distribution of degrees varies between networks. If degree strongly influences edge prediction, then imbalance or bias in the distribution of degrees could lead to nonspecific or misleading predictions. We introduce a network permutation framework to quantify the effects of node degree on edge prediction. Our framework decomposes performance into the proportions attributable to degree and the network’s specific connections. We discover that performance attributable to factors other than degree is often only a small portion of overall performance. Degree’s predictive performance diminishes when the networks used for training and testing—despite measuring the same biological relationships—were generated using distinct techniques and hence have large differences in degree distribution. We introduce the permutation-derived edge prior as the probability that an edge exists based only on degree. The edge prior shows excellent discrimination and calibration for 20 biomedical networks (16 bipartite, 3 undirected, 1 directed), with AUROCs frequently exceeding 0.85. Researchers seeking to predict new or missing edges in biological networks should use the edge prior as a baseline to identify the fraction of performance that is nonspecific because of degree. We released our methods as an open-source Python package (https://github.com/hetio/xswap/).
  
  This work has been peer reviewed in GigaScience (https://doi.org/10.1093/gigascience/giae001), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 2: Linlin Zhuo
  
  In this manuscript, the authors introduce a network permutation framework to quantify the effects of node degree on edge prediction. The importance of degree in the edge detection task is self-evident, and the quantification of this effect is undoubtedly groundbreaking. The experimental results on a variety of datasets demonstrate the advanced nature of the method proposed by the authors. However, some parts require further explanation from the authors and can be considered for acceptance in a later stage.
  
  1.The imbalance of the degree distribution has a significant impact on the results of the edge detection task. In this manuscript, the author proposes a framework to quantify this impact. It is important to note that the manuscript does not explicitly mention the specific form in which the quantification is reflected, such as whether it is presented as an indicator or in another form. Therefore, further explanation from the author is needed to clarify this aspect.
  
  2.The authors propose that researchers employ marginal priors as a reference point to discern the contributions attributed to node degree from those arising from specific performance. It would be helpful if the authors could elaborate further on the methodology or provide a sample demonstration to clarify the implementation of this approach.
  
  3.For the XSwap algorithm, I wonder that if the authors could provide a more detailed explanation of its workings, including a step-by-step implementation of the improved XSwap. Furthermore, it would be beneficial if the authors could highlight the significance of the improved XSwap algorithm in biomedical tasks.
  
  4.The author presents the pseudocode of the XSwap algorithm in Figure 2, along with the improved pseudocode after the author's enhancements. Both pseudocodes are accompanied by explanatory text. However, I believe that expressing them in the form of a figure would make it more visually appealing and intuitive.
  
  5.The authors introduce the edge prior to quantify the probability of two nodes being connected solely based on their degree. I request the authors to provide a detailed explanation of the specific implementation of the edge prior.
  
  6.In the "Prediction tasks" section, the author utilizes three prediction tasks to assess the performance of the edge prior. It is recommended to segment correctly for better display of the content.
  
  7.The focus of the article might not be prominent enough. It is advisable for the author to provide further elaboration on the advanced nature of the proposed framework and its significance in practical tasks. This would help emphasize the main contributions of the research and its relevance in real-world applications.
2. GigaScience 04 Mar 2024
  
  in GigaScience
  
  AbstractImportant tasks in biomedical discovery such as predicting gene functions, gene-disease associations, and drug repurposing opportunities are often framed as network edge prediction. The number of edges connecting to a node, termed degree, can vary greatly across nodes in real biomedical networks, and the distribution of degrees varies between networks. If degree strongly influences edge prediction, then imbalance or bias in the distribution of degrees could lead to nonspecific or misleading predictions. We introduce a network permutation framework to quantify the effects of node degree on edge prediction. Our framework decomposes performance into the proportions attributable to degree and the network’s specific connections. We discover that performance attributable to factors other than degree is often only a small portion of overall performance. Degree’s predictive performance diminishes when the networks used for training and testing—despite measuring the same biological relationships—were generated using distinct techniques and hence have large differences in degree distribution. We introduce the permutation-derived edge prior as the probability that an edge exists based only on degree. The edge prior shows excellent discrimination and calibration for 20 biomedical networks (16 bipartite, 3 undirected, 1 directed), with AUROCs frequently exceeding 0.85. Researchers seeking to predict new or missing edges in biological networks should use the edge prior as a baseline to identify the fraction of performance that is nonspecific because of degree. We released our methods as an open-source Python package (https://github.com/hetio/xswap/).
  
  This work has been peer reviewed in GigaScience (https://doi.org/10.1093/gigascience/giae001), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 1: Babita Pandey
  
  The manuscript "The probability of edge existence due to node degree: a baseline for network-based predictions" presents novel work. But some of the sections are written very briefly, so it is difficult to understand. The section that needs revision are: Degree-grouping, The edge prior encapsulates degree, Degree can underly a large fraction of performance and Analytical approximation of the edge prior. The result section needs revision.
  
  Some other concerns are: Academic adhar, Jaccard coefficient, preferential atachment etc are link prediction methods. Why auther has termed them as edge prediction features.
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2023.01.05.522939v1
Feb 2024
www.biorxiv.org www.biorxiv.org

SAW: An efficient and accurate data analysis workflow for Stereo-seq spatial transcriptomics

2
1. GigaScience 26 Feb 2024
  
  in GigaByte
  
  Editors Assessment:
  
  One limiting factor in the adoption of spatial omics research are workflow systems for data preprocessing, and to address these authors developed the SAW tool to process Stereo-seq data. The analysis steps of spatial transcriptomics involve obtaining gene expression information from space and cells. Existing tools face issues with large data sets, such as intensive spatial localization, RNA alignment, and excessive memory usage. These issues affect the process's applicability and efficiency. To address this, this paper presents a high-performance open-source workflow called SAW for Stereo-Seq. This includes mRNA position reconstruction, genome alignment, matrix generation, clustering, and result file generation for personalized analysis. During review the authors have added examples of MID correction in the article to make the process easier to understand. And In the future, more accurate algorithms or deep learning models may further improve the accuracy of this pipeline.
  
  *This evaluation refers to version 1 of the preprint *
  
  Summary
2. GigaScience 26 Feb 2024
  
  in GigaByte
  
  AbstractThe basic analysis steps of spatial transcriptomics involve obtaining gene expression information from both space and cells. This process requires a set of tools to be completed, and existing tools face performance issues when dealing with large data sets. These issues include computationally intensive spatial localization, RNA genome alignment, and excessive memory usage in large chip scenarios. These problems affect the applicability and efficiency of the process. To address these issues, a high-performance and accurate spatial transcriptomics data analysis workflow called Stereo-Seq Analysis Workflow (SAW) has been developed for the Stereo-Seq technology developed by BGI. This workflow includes mRNA spatial position reconstruction, genome alignment, gene expression matrix generation and clustering, and generate results files in a universal format for subsequent personalized analysis. The excutation time for the entire analysis process is ∼148 minutes on 1G reads 1*1 cm chip test data, 1.8 times faster than unoptimized workflow.
  
  This work has been published in GigaByte Journal under a CC-BY 4.0 license (https://doi.org/10.46471/gigabyte.111) as part of our Spatial Omics Methods and Applications series (https://doi.org/10.46471/GIGABYTE_SERIES_0005), and has published the reviews under the same license as follows:
  
  Reviewer 1. Zexuan Zhu
  
  It would be helpful if some examples can be provided to illustrate the key steps, e.g., the gene region annotation process and MID correction. Some information of the references is missing. Please carefully check the format of the references.
  
  Decision: Minor Revision
  
  Reviewer 2. Yanjie Wei
  
  In this manuscript, the authors introduce a comprehensive Stereo-seq spatial transcriptomics analysis workflow, termed SAW. This workflow encompasses mRNA spatial position reconstruction, genome alignment, gene expression matrix generation, and clustering, culminating in the production of universally formatted results files for subsequent personalized analysis. SAW is particularly optimized for large field Stereo-seq spatial transcriptomics.
  
  The authors provide an in-depth elucidation of SAW's workflow and the optimization techniques employed for each module. However, several aspects warrant further discussion:
  
  The authors outline a strategy to reduce memory consumption during the mapping of CID tagged reads to corresponding coordinates by partitioning the mask file and fastq files. The manuscript, however, lacks a detailed description of how these files are divided. It would be beneficial if the authors could furnish additional information regarding this partitioning method.
  
  The gene expression matrix, a crucial output of the SAW process, lacks sufficient evaluation to substantiate its accuracy. The count tool generates this matrix through three primary steps: gene region annotation, MID correction, and MID deduplication. During the gene annotation phase, a hard threshold (50% of the read overlapping with exon) is used to determine if a read is exonic. The basis for this threshold, however, remains unclear.
  
  In the testing section, the authors evaluated the workflow on 2 S1 chips with approximately 1 million reads. The optimized workflow demonstrated a 1.8-fold speed increase compared to the non-optimized version. Table 2 only presents the total runtime before and after optimization. It would be advantageous if the authors could enrich this table by including the runtime of critical modules, such as read mapping, which accounts for 70% of the total runtime.
Visit annotations in context

Tags

Summary

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2023.08.20.554064v1
www.biorxiv.org www.biorxiv.org

Generating single-cell gene expression profiles for high-resolution spatial transcriptomics based on cell boundary images

2
1. GigaScience 26 Feb 2024
  
  in GigaByte
  
  ABSTRACTStereo-seq is a cutting-edge technique for spatially resolved transcriptomics that combines subcellular resolution with centimeter-level field-of-view, serving as a technical foundation for analyzing large tissues at the single-cell level. Our previous work presents the first one-stop software that utilizes cell nuclei staining images and statistical methods to generate high-confidence single-cell spatial gene expression profiles for Stereo-seq data. With recent advancements in Stereo-seq technology, it is possible to acquire cell boundary information, such as cell membrane/wall staining images. To take advantage of this progress, we update our software to a new version, named STCellbin, which utilizes the cell nuclei staining images as a bridge to align cell membrane/wall staining images with spatial gene expression maps. By employing an advanced cell segmentation technique, accurate cell boundaries can be obtained, leading to more reliable single-cell spatial gene expression profiles. Experimental results verify that STCellbin can be applied on the mouse liver (cell membranes) and Arabidopsis seed (cell walls) datasets and outperforms other competitive methods. The improved capability of capturing single cell gene expression profiles by this update results in a deeper understanding of the contribution of single cell phenotypes to tissue biology.
  
  This work has been published in GigaByte Journal under a CC-BY 4.0 license (https://doi.org/10.46471/gigabyte.110) as part of our Spatial Omics Methods and Applications series (https://doi.org/10.46471/GIGABYTE_SERIES_0005), and has published the reviews under the same license as follows:
  
  Reviewer 1. Chunquan Li
  
  Stereo-seq, an advanced spatial transcriptomics technique, allows detailed analysis of large tissues at the single-cell level with precise subcellular resolution. Author's prior software was groundbreaking, generating robust single-cell spatial gene expression profiles using cell nuclei staining images and statistical methods. They've enhanced their software to STCellbin, using cell nuclei images to align cell membrane/wall staining images. This update employs improved cell segmentation, ensuring accurate boundaries and more dependable single-cell spatial gene expression profiles. Successful tests on mouse liver and Arabidopsis seed datasets demonstrate STCellbin's effectiveness, enabling a deeper insight into the role of single-cell characteristics in tissue biology. However, I do have some suggestions and questions about certain parts of the manuscript. 1. The authors should show the advantages and performance of STCellbin compared to other methods, such as its computational efficiency, accuracy, and suitability for various image types. 2. To comprehensively assess the performance of STCellbin, the authors should consider integrating other commonly used cell segmentation evaluation metrics, such as F1-score, Dice coefficient, and so forth. 3. To ensure the completeness and reproducibility of the data analysis, more detailed information regarding the clustering analysis of the single-cell spatial gene expression maps generated through STCellbin is requested. This information should encompass methods, parameters, and results such as cluster type annotations. 4. The authors can use simpler and clearer language and terminology to describe the image registration process in the methods section, ensuring that readers can easily understand the workflow and principles of image registration.
  
  Reviewer 2. Zhaowei Wang
  
  In this manuscript, the authors propose STCellbin to generate single-cell gene expression profiles for high-resolution spatial transcriptomics based on cell boundary images. The experiment results on mouse liver and Arabidopsis seed datasets prove the good performance of STCellbin. The topic is significant and the proposed method is feasible. However, there are still some concerns and problems to be improved and clarified.
  
  (1) STCellbin is an update version of StereoCell, but the explanation of StereoCell is not sufficient. The authors should give a more detailed explanation of StereoCell, such as its input and main process. (2) The authors list some existing dyeing methods in Lines 52-53, Page 3. They should clarify that these methods are used for nuclei staining, which differentiate them from the cell membrane/wall staining methods of following content. It can provide a more accurate explanation for readers and users. (3) The authors share the GitHub repository of STCellbin, and I noticed that when executing STCellbin, the input only requires the path of image data and spatial gene expression data, the path of the output results, and the chip number. Are there other adjustable parameters? (4) In Page 5, Line 85, “steps” should be “step”, and in Page 8, Line 145, “must” would be better revised to “should”. Moreover, the writing of “stained image” and “staining image” should be consistent.
2. GigaScience 26 Feb 2024
  
  in GigaByte
  
  Editors Assessment:
  
  This paper describes a new spatial transcriptomics method that that utilizes cell nuclei staining images and statistical methods to generate high-confidence single-cell spatial gene expression profiles for Stereo-seq data. STCellbin is an update of StereoCell, now using a more advanced cell segmentation technique, so more accurate cell boundaries can be obtained, allowing more reliable single-cell spatial gene expression profiles to be obtained. After peer review more comparisons were added and more description given on what was upgraded in this version to convince the reviewers. Demonstrating it is a more reliable method, particularly for analyzing high-resolution and large-field-of-view spatial transcriptomic data. And extending the capability to automatically process Stereo-seq cell membrane/wall staining images for identifying cell boundaries.
  
  This evaluation refers to version 2 of the preprint
  
  Summary
Visit annotations in context

Tags

Summary

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2023.12.25.573324v2
www.biorxiv.org www.biorxiv.org

BatchEval Pipeline: Batch Effect Evaluation Workflow for Multiple Datasets Joint Analysis

2
1. GigaScience 26 Feb 2024
  
  in GigaByte
  
  Editors Assessment:
  
  For better data quality assessment of large spatial transcriptomics datasets this new BatchEval software has been developed as a batch effect evaluation tool. This generates a comprehensive report with assessment findings, including basic information of integrated datasets, a batch effect score, and recommended methods for removing batch effects. The report also includes evaluation details for the raw dataset and results from batch effect removal methods. Through peer review and clarification of a number of points it now looks convincing that this tool helps researchers identify and remove batch effects, ensuring reliable and meaningful insights from integrated datasets. Potentially making the tool valuable for researchers who need to analyze large datasets of this type, as it provides an easy and reliable way to assess data quality and ensures that downstream analyses are robust and reliable.
  
  This evaluation refers to version 1 of the preprint
  
  Summary
2. GigaScience 26 Feb 2024
  
  in GigaByte
  
  ABSTRACTAs genomic sequencing technology continues to advance, it becomes increasingly important to perform joint analyses of multiple datasets of transcriptomics. However, batch effect presents challenges for dataset integration, such as sequencing data measured on different platforms, and datasets collected at different times. Here, we report the development of BatchEval Pipeline, a batch effect workflow used to evaluate batch effect on dataset integration. The BatchEval Pipeline generates a comprehensive report, which consists of a series of HTML pages for assessment findings, including a main page, a raw dataset evaluation page, and several built-in methods evaluation pages. The main page exhibits basic information of the integrated datasets, a comprehensive score of batch effect, and the most recommended method for removing batch effect from the current datasets. The remaining pages exhibit evaluation details for the raw dataset, and evaluation results from the built-in batch effect removal methods after removing batch effect. This comprehensive report enables researchers to accurately identify and remove batch effects, resulting in more reliable and meaningful biological insights from integrated datasets. In summary, the BatchEval Pipeline represents a significant advancement in batch effect evaluation, and is a valuable tool to improve the accuracy and reliability of the experimental results.
  
  This work has been published in GigaByte Journal under a CC-BY 4.0 license (https://doi.org/10.46471/gigabyte.108) as part of our Spatial Omics Methods and Applications series (https://doi.org/10.46471/GIGABYTE_SERIES_0005), and has published the reviews under the same license as follows:
  
  **Reviewer 1. Chunquan Li **
  
  Page 1, Lines 14-16. The authors indicate that “it is crucial to thoroughly investigate the batch effects in the dataset before integrating and processing the data”. The term “thoroughly” may be not accurate enough. The current method can alleviate the batch effects, but it can’t thoroughly solve the related problems. In addition, this work proposes a batch evaluation tool, such “reasonably evaluate the batch effects” may be more accurate than “thoroughly investigate the batch effects”.
  
  In Figure 1, does the first box is “integrated datasets”?
  
  Page 5, Line 168, and Page 6, Lines 169-175, the content of these two paragraphs is similar, with some redundant descriptions. It is recommended to organize and write them into one paragraph.
  
  There is Table 1 in the table list, but Table 1 is missing in the main text.
  
  Page 8, Discussion section, it is better to discuss the differences between the proposed tool and a similar tool “batchQC”, especially the advantages of the proposed tool.
  
  Some other minor issues: Page 1, Line 22, “to do so” should be “to do it”. Page 3, Line 100, Ref. [13] should be cited when it first appears on Line 97. Page 4, Line 114 and Page 5, Line 146, “UMAP” should be given its full name when it first appears and abbreviated directly in the following text. The variable should be in italics, such as “p” on Page 4, Line 119, “H” on Page 6, Line 184.
  
  Reviewer 2. W. Evan Johnson and Howard Fan
  
  Is the source code available, and has an appropriate Open Source Initiative license (https://opensource.org/licenses) been assigned to the code?
  
  Yes. However, the code could use substantial improvements.
  
  Is installation/deployment sufficiently outlined in the paper and documentation, and does it proceed as outlined?
  
  No. The manuscript is missing a section describing the software and its implementation.
  
  Is there enough clear information in the documentation to install, run and test this tool, including information on where to seek help if required?
  
  Yes. But it took a while to get it installed.
  
  Have any claims of performance been sufficiently tested and compared to other commonly-used packages?
  
  No. I think the most glaring deficiency in the paper is the lack of comparison with other methods. For example, there is no comparison of the tools available in BatchEval compared to other methods, such as BatchQC. Also, they mention that BatchQC might not work on larger datasets, but they perform no performance evaluation for BatchEval, and no comparison with BatchQC to demonstrate improved performance.
  
  Are there (ideally real world) examples demonstrating use of the software?
  
  Yes. Missed opportunity--I think the most exciting thing I observed from the paper was that the example data were from spatial transcriptomics data! To my knowledge, existing batch effect methods are not directly adapted to manage these data (although they did mention tools like BatchQC cannot handle large datasets, which may be true). But they don’t mention anything about batch adjustment/evaluation in spatial data in the manuscript. I feel that if the authors address this niche it would increase the value/impact of their work!
  
  Additional Comments:
  
  This review was conducted and written by Evan Johnson, who developed the competing BatchQC software.
  
  The authors provide an interesting toolkit for assessing batch effects in genomics data. The paper was clear and well-written, albeit I had a few concerns (see below). We were also able to download the associated software and test it out (comments below as well).
  
  I think the most exciting thing I observed from the paper was that the example data were from spatial transcriptomics data! To my knowledge, existing batch effect methods are not directly adapted to manage these data (although they did mention tools like BatchQC cannot handle large datasets, which may be true). But they don’t mention anything about batch adjustment/evaluation in spatial data in the manuscript. I feel that if the authors address this niche it would increase the value/impact of their work!
  
  In addition, this toolkit is written in Python, while BatchQC and other tools are written in R, so this is an advantage of the method as well—it addresses an audience that uses Python for gene expression analysis (not as big as the R community, but substantial). Their Python toolkit might also be more accessible to implementation in a pipeline workflow (for a core or large project) than R-based tools like BatchQC—this might be important to mention this as well.
  
  I think the most glaring deficiency in the paper is the lack of comparison with other methods. For example, there is no comparison of the tools available in BatchEval compared to other methods, such as BatchQC. Also, they mention that BatchQC might not work on larger datasets, but they perform no performance evaluation for BatchEval, and no comparison with BatchQC to demonstrate improved performance.
  
  Similarly, the authors claim: “Manimaran [10] has developed user-friendly software for evaluating batch effects. However, the software does not take into account nonlinear batch effects and may not be able to provide objective conclusions.” I don’t understand what the authors mean by “may not be able to provide objective conclusions” – BatchQC provides – several visual and numerical evaluations of batch effect – more so than even the proposed BatchEval does. Did the authors mean something else, maybe that the lack of non-linear correction may lead to less accurate conclusions?
  
  A related concern: does BatchEval provide non-linear adjustments? I may have missed this, but it seems that BatchEval is not providing non-linear adjustments either. Also, regarding non-linear adjustments, the authors should show in an example the problems with a lack non-linear adjustments and show that pre-transforming the data before using BatchQC does not perform as well as the non-linear BatchEval adjustments.
  
  In Equation 10, should “batchScore” be BatchEvalScore?
  
  Also, in the bottom of Figure on page 15, should the “BatchQCScore” also be BatchEvalScore??
  
  The manuscript is missing a section describing the software and its implementation.
  
  I asked my research scientist, who recently graduated with his PhD in Bioinformatics, to assess the software and examples. First of all, much of the software is named “BatchQC”. I think this is confusing, since the method is really named BatchEval and it will be confused with BatchQC which is another existing/competing software. Furthmore, it took him a significant effort to install the BatchEval software and get is working on our cluster. I would recommend the authors make their software more accessible and easier to install.
  
  The output of the software was a nice .html report diagnosing the batch effects in the data—very useful (attached is a combined .pdfs of the .htmls that we generated). We were also able to generate a report for the harmony adjusted example using their code. One major disadvantage was that these reports are separate files, and this could get very complicated comparing cases using multiple batch effect methods that will all be in separate reports (refer to a recent single cell batch comparison that compared more than a dozen methods – Tran et al. Genome Biology, 2020 – it would be hard to use BatchEval for this comparison).
  
  Also, it seems that the user is required to conduct the batch correction themselves, BatchEval does not help with the correction except for their example code for Harmony.
  
  Finally, on comparing the raw and Harmony adjusted datasets, inspection of the visual assessments (e.g. PCA) show some improvement—although not a perfect correction. But must of the numerical assessments are still the sample. The BatchEvalScore in both cases leads to the conclusion “Need to do batch effect removal”. What’s missing is the difference or improvement that Harmony makes on its correction. Maybe this is just because Harmony doesn’t fully remove the batch effects? Or is there something not working in the code? Might be good to see another example where the batch effect correction improves the BatchEvalScore significantly.
  
  Additional Files: https://gigabyte-review.rivervalleytechnologies.com/journal/gx/download-files?YXJ0aWNsZT00NDImZmlsZT0xNzEmdHlwZT1nZW5lcmljJnZpZXc9dHJ1ZQ~~
  
  Re-review:
  
  I find this paper to be much improved in this version. The authors have clearly worked hard to address my concerns and have addressed them in a satisfactory manner. I fully support the publication of this paper, and I believe their tools are a nice addition to the field.
Visit annotations in context

Tags

Summary

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2023.10.08.561465v2
academic.oup.com academic.oup.com

On the benefits of self-taught learning for brain decoding

3
1. GigaScience 26 Feb 2024
  
  in Public
  
  more sophisticated models [4–11].
  
  Reference 6 has been retracted due to potential manipulation of the publication process. The publisher of this paper cannot vouch for its reliability, but in this case this citation does not change the conclusions of the work published here. Though we thought we would highlight this to let readers know.
2. GigaScience 26 Feb 2024
  
  in Public
  
  Qureshi MB, Azad L, Qureshi MS, et al. Brain decoding using fMRI images for multiple subjects through deep learning. Comput Math Methods Med. 2022;2022:1–10.
  
  Reference 6 has been retracted due to potential manipulation of the publication process. The publisher of this paper cannot vouch for its reliability, but in this case this citation does not change the conclusions of the work published here. Though we thought we would highlight this to let readers know.
3. GigaScience 26 Feb 2024
  
  in Gigascience Annotations
  
  [4–11].
  
  Reference 6 has been retracted due to potential manipulation of the publication process. The publisher of this paper cannot vouch for its reliability, but in this case this citation does not change the conclusions of the work published here. Though we thought we would highlight this to let readers know.
Visit annotations in context

Annotators

GigaScience

URL

academic.oup.com/gigascience/article/doi/10.1093/gigascience/giad029/7150396
www.biorxiv.org www.biorxiv.org

spatiAlign: An Unsupervised Contrastive Learning Model for Data Integration of Spatially Resolved Transcriptomics

2
1. GigaScience 21 Feb 2024
  
  in GigaScience
  
  AbstractIntegrative analysis of spatially resolved transcriptomics datasets empowers a deeper understanding of complex biological systems. However, integrating multiple tissue sections presents challenges for batch effect removal, particularly when the sections are measured by various technologies or collected at different times. Here, we propose spatiAlign, an unsupervised contrastive learning model that employs the expression of all measured genes and the spatial location of cells, to integrate multiple tissue sections. It enables the joint downstream analysis of multiple datasets not only in low-dimensional embeddings but also in the reconstructed full expression space. In benchmarking analysis, spatiAlign outperforms state-of-the-art methods in learning joint and discriminative representations for tissue sections, each potentially characterized by complex batch effects or distinct biological characteristics. Furthermore, we demonstrate the benefits of spatiAlign for the integrative analysis of time-series brain sections, including spatial clustering, differential expression analysis, and particularly trajectory inference that requires a corrected gene expression matrix.Competing Interest Statement
  
  Reviewer 2. Stefano Monti
  
  The manuscript addresses the very challenging problem of integrating multiple spatially resolved transcriptomics datasets, and proposes a novel algorithm based on multiple deep learning techniques, including DNN encoders, and self supervised and contrastive learning. Evaluation on several datasets is presented alongside comparison to multiple existing methods using several integration metrics (LISI, ARI, etc.). The presented method appears to outperform existing methods according to multiple criteria, and thus it represents a significant contribution to the field worth publishing.
  
  The write-up is adequate, although the description of the method very "abstract", and it would benefit from more specificity in describing the inputs and outputs of each step, how some of the models are shared (e.g., is the DNN encoder shared only across sections/samples or also across the original (Fig 1C, top) and perturbed (Fig 1C, bottom) inputs? Likewise for the Graph Encoder), and the intuition behind each of the steps included.
  
  Some specific comments: * It would be helpful if the results sections describing each of the applications (DLPFC datasets, Olfactory bulb datasets, etc.) were more detailed in the description of the datasets to be combined. What are the inputs (how many samples, are sections the same as samples?, how many slices per sample, etc). * Unless I'm mistaken, the labeling of Fig S1 is wrong. I think fig S1a is the UMap and S1b is the "manual annotation" rather than the other way around?
2. GigaScience 21 Feb 2024
  
  in GigaScience
  
  AbstractIntegrative analysis of spatially resolved transcriptomics datasets empowers a deeper understanding of complex biological systems. However, integrating multiple tissue sections presents challenges for batch effect removal, particularly when the sections are measured by various technologies or collected at different times. Here, we propose spatiAlign, an unsupervised contrastive learning model that employs the expression of all measured genes and the spatial location of cells, to integrate multiple tissue sections. It enables the joint downstream analysis of multiple datasets not only in low-dimensional embeddings but also in the reconstructed full expression space. In benchmarking analysis, spatiAlign outperforms state-of-the-art methods in learning joint and discriminative representations for tissue sections, each potentially characterized by complex batch effects or distinct biological characteristics. Furthermore, we demonstrate the benefits of spatiAlign for the integrative analysis of time-series brain sections, including spatial clustering, differential expression analysis, and particularly trajectory inference that requires a corrected gene expression matrix.
  
  Reviewer 1. Lamda Moses.
  
  This papers presents spatiAlign, a package that batch corrects spatial transcriptomics data and performs spatially informed clustering. Spatial information is incorporated in the graph layers in the variational graph autoencoder which performs dimension reduction, and in the reduced dimensional space, self-supervised contrastive learning is used to batch correct and to assign cells/spots to clusters. The autoencoder then reconstructs a batch corrected gene count matrix for downstream use with methods that require a full gene count matrix. The method seems reasonable for this task and is well-described, more intuitively in the Results section and in more details in the Methods section.
  
  Then spatiAlign is benchmarked against several popular and state of the art methods for batch correction, including two recently published methods that use spatial information (GraphST and PRECAST) and several not using spatial information but commonly used (e.g. Seurat, Harmony, COMBAT). The choice of existing methods to benchmark is fair. The LISI F1 score is a reasonable metric to quantify performance in both batch correction and cluster separation when the spatial clusters in the brain datasets used in benchmarking are already annotated. The iLISI (batch correction) and cLISI (cluster separation), analogous to precision and recall in the original F1, are shown separately in the supplement. The F1 score is around 0.8 for spatiAlign, which is pretty good. When there is no a priori annotation, the iLISI is used to quantify how well different batches mix and Moran's I is used to indicate spatial coherence of the clusters, which are then validated with differential expression. spatiAlign is also demonstrated to integrate data from different technologies—Stereo-seq and Visium—which have different spatial resolutions. Finally, spatiAlign is demonstrated on the developing mouse brain integrating data across multiple time points.
  
  The language of this paper is good and does not require extensive editing for clarity. The spatiAlign package can be installed with pip and has a minimal tutorial on the documentation website.
  
  Overall, I find this paper well-written and a valuable contribution to this field. There are many methods that perform batch correction without using spatial information, and several that align different tissue sections, some using transcriptome information, but without correcting for batch effect in the transcriptomes. Not all methods that take spatial information into account give a batch corrected full gene count matrix as an output. The metrics reasonably demonstrate superior performance of spatiAlign compared to other methods benchmarked on the datasets used.
  
  Below are my questions and comments that may improve this paper:
  
  All the benchmarking datasets are from the brain, though different parts of the brain, from human and mouse, with different morphologies. The brain has a stereotypical structure. As spatiAlign uses the spatial neighborhood graph rather than the original coordinates, can it be applied to tissues without such stereotypical structure, such as tumors, skeletal muscle, colon, liver, lung, and adipose tissue? Benchmarking on a dataset from a tissue without a stereotypical structure would make a stronger case, to be more representative of the full breadth of spatial transcriptomics datasets.
  
  Biological variability is mentioned, such as from different regions of hippocampus and different stages of development. Many studies have a disease or experiment group and a control group, often with multiple subjects in each group. There are biological differences among the subjects and technical batch effects between sections, but the differences between case and control are of interest, so we have different kinds of batches. Benchmarking on a case/control study would be really helpful. How well does spatiAlign preserve biological differences between case and control while correcting for technical batch effects?
  
  The Methods section says, "Inspired by unsupervised contrastive clustering[32], we map each spot/cell i into an embedding space with d dimensions, where d is equal to the number of pseudoprototypical clusters." In Tutorial 2 on the documentation website, the latent dimension is set to be 100. Why is this number chosen? Can you clarity how to choose the number of latent dimensions? How does this affect downstream results?
  
  Since you use the k nearest neighbor graph when constructing the spatial neighborhood graph that feeds into the variational graph autoencoder, what are the reasons why k=15 is chosen? Should it be different for array-based technologies such as Visium and Stereo-seq and imaging-based technologies with single cell resolution such as MERFISH? Furthermore, due to different spatial resolutions, the spatial neighborhood graph has different biological meanings for Visium and MERFISH.
  
  All the benchmarking datasets are from array-based technologies: Visium, Slide-seq, and Stereo-seq. Imaging-based technologies are getting commercialized and getting more widely adopted, especially MERFISH and Molecular Cartography. It would be great if you benchmark using an imaging-based dataset and perhaps integrate an imaging-based and an array-based dataset, to be more representative of the full breadth of spatial transcriptomics technologies. This should also take into consideration that imaging-based datasets typically only profile a few hundred genes while array-based datasets are transcriptome-wide. This might be too much for this paper, but should at least be mentioned in the Discussions section.
  
  Is the code used to reproduce the figures available?
  
  Generally, the y axes of bar charts for F1 scores, ARI, normalized iLISI, and normalized cLISI are really confusing when they don't start at 0 and end at 1. This exaggerates how much better spatiAlign performs compared to other methods when the other methods aren't that much worse based on the numbers, such as in Figure 2c.
  
  In Supplementary Figure S4b, do you actually mean 1 - cLISI? If a smaller cLISI is better, then spatiAlign performs the worst in this case, and should have a low F1 score in Figure 2c.
  
  It would be helpful to include a computational time and memory usage benchmark.
  
  The join count statistic is a spatial autocorrelation statistic designed for binary data, and may thus be more appropriate than Moran's I to indicate spatial coherence of clusters, although Moran's I does convey the message of spatial coherence here.
  
  The documentation website can be improved by making a description of all parameters of the functions available, to explain what each parameter means and what kind of input and output is expected.
  
  It would be helpful to include preprocessing in the tutorial on the documentation website. Do we need to log normalize the data first and why? Does the data need to be scaled?
  
  Below are minor technical comments: 1. The notation for the LISI F1 score in the Methods sections is very confusing. Based on context and the definition of the F1 score, you probably meant to put parentheses around 1 - cLISInorm . 2. Typo in "SCAlEX" in Supplementary Figure S5a; you seem to mean "SCALEX". It's more aesthetically pleasing to be consistent in capitalizing according to the original names of the packages in Supplementary Figure S5.
  
  Re-review
  
  For the most part, the authors have satisfactorily addressed concerns raised by the reviewers. Below are my followup comments on the revised manuscript: 1. The authors missed the point of my second comment on case/control studies. What I was asking for is performance of spatiAlign and other related packages when integrating case datasets and control datasets while preserving biological differences of interest to the study. For example, data from healthy liver (control) and hepatic steatosis (case) are integrated. Case and control samples were collected from different patients and may be mounted on different slides. How well does spatiAlign preserve differences between healthy and steatosis, while correcting for technical batch effect? In Figure S7, the two sub-slices are still from the same disease condition. Case/control studies should at least be mentioned in the Discussions section. 2. The authors have provided thoughtful explanations on data scaling, number of latent dimensions, and number of neighbors in the k nearest neighbor graph in the response to reviewers. However, these explanations are not found in the manuscript or on the documentation website. Because these explanations are very relevant to users, it would be helpful to add them to either the manuscript or the documentation website. 3. For the bar charts, I suggest assigning a fixed color to each data integration method and keeping it consistent throughout this study. Right now the bar charts don't have a consistent color scheme even within the same figure. Keeping a consistent color scheme can reduce the mental burden of readers since the colors are a stand-in for the different methods. Also, a colorblind-friendly palette should be used. 4. I agree with Reviewer 3 that the grammar in this paper should be improved. For example, in lines 75-76, "in which gene expression is adjustment" should be "in which gene expression is adjusted". In lines 82-83, the "adjusted" in "laminar organization with adjusted, and clear boundaries between regions" does not make sense given the context referring to Figure 2f. In line 332, "the benchmarking methods" should be "the benchmarked methods", because the methods are being benchmarked and the methods themselves are not meant for benchmarking. Grammar in the newly added section from line 344 onwards should be corrected.
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2023.08.08.552402v2
www.biorxiv.org www.biorxiv.org

Near chromosome-level and highly repetitive genome assembly of the snake pipefish Entelurus aequoreus (Syngnathiformes: Syngnathidae)

2
1. GigaScience 14 Feb 2024
  
  in GigaByte
  
  Editors Assessment:
  
  The snake pipefish, Entelurus aequoreus, is a species of fish that dwells in open seagrass habitats in the northern Atlantic. As a pipefish, it is a member of the Syngnathidae family of fish which also includes seahorses and seadragons. In recent years it has expanded its population size and range into arctic waters. To better understand these demographic changes genomic data is useful, and to address this a high-quality reference genome has been produced. Building on a previous short-read reference, a near chromosome-scale genome assembly for the snake pipefish was assembled using PacBio CLR and Hi-C reads. After revisions the authors provided more details on the assembly metrics, the final assembly has a length of 1.6 Gbp, with scaffold and contig N50s of 62.3 Mbp and 45.0 Mbp respectively. Demographic inference analysis of the snake pipefish genome using this data enables tracing of population changes over the past 1 million years, and this reference will allow further analyses and studies relating these to changes in climate.
  
  **This evaluation refers to version 1 of the preprint *
  
  Summary
2. GigaScience 14 Feb 2024
  
  in GigaByte
  
  AbstractThe snake pipefish, Entelurus aequoreus (Linnaeus, 1758), is a slender, up to 60 cm long, northern Atlantic fish that dwells in open seagrass habitats and has recently expanded its distribution range. The snake pipefish is part of the family Syngnathidae (seahorses and pipefish) that has undergone several characteristic morphological changes, such as loss of pelvic fins and elongated snout. Here, we present a highly contiguous, near chromosome-scale genome of the snake pipefish assembled as part of a university master’s course. The final assembly has a length of 1.6 Gbp in 7,391 scaffolds, a scaffold and contig N50 of 62.3 Mbp and 45.0 Mbp and L50 of 12 and 14, respectively. The largest 28 scaffolds (>21 Mbp) span 89.7% of the assembly length. A BUSCO completeness score of 94.1% and a mapping rate above 98% suggest a high assembly completeness. Repetitive elements cover 74.93% of the genome, one of the highest proportions so far identified in vertebrate genomes. Demographic modeling using the PSMC framework indicates a peak in effective population size (50 – 100 kya) during the last interglacial period and suggests that the species might largely benefit from warmer water conditions, as seen today. Our updated snake pipefish assembly forms an important foundation for further analysis of the morphological and molecular changes unique to the family Syngnathidae.
  
  This work has been published in GigaByte Journal under a CC-BY 4.0 license (https://doi.org/10.46471/gigabyte.105), and has published the reviews under the same license as follows:
  
  Reviewer 1. Yanhong Zhang
  
  Are all data available and do they match the descriptions in the paper? No. There is no BioProject available for review at the link. Are the data and metadata consistent with relevant minimum information or reporting standards?
  
  No. "the GigaDB repository:DOI:XXXXX." I am not sure that the authors have upload the data.
  
  Is the data acquisition clear, complete and methodologically sound? No. I am not sure that the authors have uploaded the data.
  
  Is there sufficient data validation and statistical analyses of data quality? No. I need more information.
  
  Is the validation suitable for this type of data? No. I need more information.
  
  Is there sufficient information for others to reuse this dataset or integrate it with other data? No. I need more information.
  
  Any Additional Overall Comments to the Author:
  
  In line 41, you mean “50-100 kya”?
  
  The authors need to provide more details about the genomic data： Genome size estimation based on K-mer spectrum？ Statistics of genomic characteristics from K-mer? Statistics of Hi-C sequencing raw data, such as raw bases, clean bases. Statistics of the pseduchromosome assemblies using Hi-C data. The result of BUSCO assessment, how about complete BUSCOs? complete single-copy？ Statistics of gene predictions in the snake pipefish Statistics of the noncoding RNA in the snake pipefish genome. The author claims that all other data, including the repeat and gene annotation, was uploaded to the GigaDB repository: DOI: XXXXX. I cannot find these data. “DOI: XXXXX”? What does that mean?
  
  Reviewer 2. Sarah Flanagan
  
  Are all data available and do they match the descriptions in the paper?
  
  No. I received an NCBI link which took me to the raw data files and a BioSample description, but it did not link to the assembled and annotated genome.
  
  Is there sufficient detail in the methods and data-processing steps to allow reproduction?
  
  Yes. Only one point was not clear to me in the methods -- please clarify in the text which data was used to generate consensus genome sequences using vcfutils (lines 240-241). How did this differ from the assembled and annotated genome?
  
  Any Additional Overall Comments to the Author:
  
  In the abstract and introduction, the description of the habitat of the species is confusing and it was not clear from the manuscript as written that there are two ecotypes, one that is pelagic and one that is coastal. Consider re-phrasing these sections (lines 31-32, 57-59, and 61-62) to better describe the habitat of this species.
  
  Please also consider increasing the font size of the labels in Figure 1 -- the details are very difficult to read.
Visit annotations in context

Tags

Summary

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2023.12.12.571260v1
www.biorxiv.org www.biorxiv.org

Species Composition and Distribution of Anopheles gambiae Complex Circulating in Kinshasa

2
1. GigaScience 14 Feb 2024
  
  in GigaByte
  
  Editors Assessment: Understanding the distribution of Anopheles mosquito species is essential for planning and implementing malaria control programmes, a task undertaken in this study that assesses the composition and distribution of the Anopheles in different districts of Kinshasa in the Democratic Republic of Congo. Mosquitoes were collected using CDC light traps, and then identified by morphological and molecular means. In total 3,839 Anopheles were collected, and data was digitised, validated and shared via the GBIF database under a CC0 waiver. The project monitoring the monthly dynamics of four species of Anopheles, showing a fluctuation in their respective frequencies during the study period. Review improved the metadata by adding more accurate date information, and this data can provide important information for further basic and advanced studies on the ecology and phenology of these vectors in West Africa.
  
  *This evaluation refers to version 1 of the preprint
  
  Summary
2. GigaScience 14 Feb 2024
  
  in GigaByte
  
  AbstractUnderstanding the distribution of Anopheles species in a region is an important task in the planning and implementation of malaria control programmes. This study was proposed to evaluate the composition and distribution of cryptic species of the main malaria vector, Anopheles gambiae complex, circulating in different districts of Kinshasa.To study the distribution of members of the An. gambiae complex, Anopheles were sampled by CDC light trap and larva collection across the four districts of Kinshasa city between July 2021 and June 2022. After morphological identification, an equal proportion of Anopheles gambiae s.l. sampled per site were subjected to polymerase chain reaction (PCR) for identification of cryptic An. gambiae complex species.The Anopheles gambiae complex was widely identified in all sites across the city of Kinshasa, with a significant difference in mean density, captured by CDC light, inside and outside households in Kinshasa (p=0.002). Two species of this complex circulate in Kinshasa: Anopheles gambiae (82.1%) and Anopheles coluzzii (17.9%). In all study sites, Anopheles gambiae was the most prevalent species. Anopheles coluzzii was very prevalent in Tshangu district. No hybrids (Anopheles coluzzii/Anopheles gambiae) were identified.Two cryptic species of the Anopheles gambiae complex circulate in Kinshasa. Anopheles gambiae s.s., present in all districts and Anopheles coluzzii, with a limited distribution. Studies on the ecology of the larval sites are essential to better understand the factors influencing the distribution of members of the An. gambiae complex in this megalopolis.
  
  This work has been published in GigaByte Journal under a CC-BY 4.0 license (https://doi.org/10.46471/gigabyte.104), and has published the reviews under the same license. This is part of the GigaByte Vectors of Human Disease series, and this and the other papers are hosted here. https://doi.org/10.46471/GIGABYTE_SERIES_0002
  
  The peer reviews are as follows.
  
  Reviewer 1. Paul Taconet
  
  Are all data available and do they match the descriptions in the paper? No
  
  Additional Comments 1/ The CDC light trap catch data are available in the GBIF release, but the larva collection data are not included in the release. These larva collection data should be either included in the GBIF release, or it should be made clear in the manuscript that this data is not published. 2/ in the dataset, the data are indicated to be reported at the species level (taxonRank = Species) but there are no An. coluzzii reported. However, in table 3 of the manuscript, some An. coluzzii are reported. This is inconsistent. My guess is that the data reported in the dataset are those out of the morphological identification, hence for An. gambiae at the COMPLEX level, and not the species. This shoud in any case be clarified and corrected : are the data in the dataset provided at the complex or at the species level ? If complex, the ScientificName and taxonRank columns should be corrected. In addition, in the dataset, you could add an "identificationRemarks" column providing the source of identification (morphological or molecular). 3/ in the dataset, for the species scientific name, I suggest to use the names as presented in : Harbach, R.E. 2013. Mosquito Taxonomic Inventory, https://mosquito-taxonomic-inventory.myspecies.info/ . Or at least, to provide the "nameAccordingTo" column. 4/ The data available are of type 'occurrence' ( only in 1 file - the "occurrence" file). For a better presentation of the data and in order to be in line with the GBIF data architecture, I would suggest to transform them into "sampling event" data (consisting in 1 'event core' file, 1 'occurence' file, and potentially extension files), which is more suited to this kind of data acquired from sampling events (see https://ipt.gbif.org/manual/en/ipt/latest/sampling-event-data) and containing external measurements (eg. temperature, see next point). This would enable the user to quickly understand the dates and locations of the sampling events. 5/ Temperature and humidity are included in the main 'occurence' file (column "dynamicProperties") : - to which reality these data correspond (mean during the night of collection ? ), and how were these data collected (instrument, etc.) ? this information is not provided in the manuscript. - Instead of putting this data in the "occurence" file, I would suggest to add a "measurement" file in the GBIF data release, containing these meteorological data. Doing so would enable to include metadata about these measurements (instrument, etc.) See e.g. https://www.gbif.org/sites/default/files/gbif_IPT-sample-data-primer_en.pdf page 6 6/ in the dataset, for some of the collected mosquitoes, you put "organismRemarks" = "unfed" . How did you collect this information ? I could not see any mention to this feeding identification, neither in the manuscript nor in the dataset. 7/ in the dataset, in the column "SamplingProtocol", there are spelling errors -> "CDC ligth trap cathes" should be corrected to "CDC light trap catches "
  
  Are the data and metadata consistent with relevant minimum information or reporting standards? No. See comments above.
  
  Is the data acquisition clear, complete and methodologically sound? No. See comments above
  
  Any Additional Overall Comments to the Author: Thanks for this nice work and the effort you put to open your data. See comments below and above to improve the work. 1/ comments for figure 1 (map) : the background layer is not very appropriate, as we miss landscape context. Maybe better to put an Open Street Map background layer, or a satellite image.
  
  Reviewer 2. Chris Hunter.
  
  Are all data available and do they match the descriptions in the paper?
  
  No. The larva data are not included in the GBIF dataset. Some of the descriptions of the data in the manuscript do not match the data available from GBIF. Any Additional Overall Comments to the Author:
  
  Major comments (Author action required): 1 - The manuscript describes larva collection and molecular identification of those species, but I cannot see any indication that those data are included in the GBIF dataset. Please clarify whether they are included or not, and if not please add them. 2 - The numbers cited in Table 1 do not match those shown in the GBIF dataset, e.g. the total of indoor/outdoor sampling events quoted in MS table 1 = 2180 / 1659, whereas in GBIF dataset there are 2304 indoor and 1535 outdoor sites listed? Please check your calculations and/or the data submitted to GBIF.
  
  Minor comments (Author action suggested): 1 - There are 59 events in the GBIF data that do not have a date. Please check those data and update if you have those dates available. 2 - The events are all included in the GBIF sampling event dataset, however “individualCount” data are not included, please explain why those counts are not included as observation dataset(s)? i.e. why is there no number of individual mosquitos included in the dataset? 3 - The full DwC-GBIF dataset does include an indication of the indoor/outdoor location of the sampling sites in the "eventRemark" column, but if you are making updates to the dataset may I suggest using the column heading “habitat” to include that information in GBIF either instead or as well. 4 - Ideally, the molecular identification data should be shared. I don’t have access to the “protocol of Scott [29]” but my assumption is that the PCR products are differentiated by size via running on a gel? If so, and you have the digital images of those gels please let the GigaByte editors know and they will help you share them via the GigaDB database.
  
  Please see the linked file "Data-Review-of-DRR-202310-03.pdf" for more details about the above concerns.
  
  https://gigabyte-review.rivervalleytechnologies.com/journal/gx/download-files?YXJ0aWNsZT00ODEmZmlsZT0xODMmdHlwZT1nZW5lcmljJnZpZXc9dHJ1ZQ~~
Visit annotations in context

Tags

Summary

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2023.10.26.564181v1
www.biorxiv.org www.biorxiv.org

A Multi-omics Data Analysis Workflow Packaged as a FAIR Digital Object

3
1. GigaScience 11 Feb 2024
  
  in GigaScience
  
  Background Applying good data management and FAIR data principles (Findable, Accessible, Interoperable, and Reusable) in research projects can help disentangle knowledge discovery, study result reproducibility, and data reuse in future studies. Based on the concepts of the original FAIR principles for research data, FAIR principles for research software were recently proposed. FAIR Digital Objects enable discovery and reuse of Research Objects, including computational workflows for both humans and machines. Practical examples can help promote the adoption of FAIR practices for computational workflows in the research community. We developed a multi-omics data analysis workflow implementing FAIR practices to share it as a FAIR Digital Object.Findings We conducted a case study investigating shared patterns between multi-omics data and childhood externalizing behavior. The analysis workflow was implemented as a modular pipeline in the workflow manager Nextflow, including containers with software dependencies. We adhered to software development practices like version control, documentation, and licensing. Finally, the workflow was described with rich semantic metadata, packaged as a Research Object Crate, and shared via WorkflowHub.Conclusions Along with the packaged multi-omics data analysis workflow, we share our experiences adopting various FAIR practices and creating a FAIR Digital Object. We hope our experiences can help other researchers who develop omics data analysis workflows to turn FAIR principles into practice.
  
  Reviewer 3 Megan Hagenauer - Original Submission
  
  Review of "A Multi-omics Data Analysis Workflow Packaged as a FAIR Digital Object" by Niehues et al. for GigaScience08-31-2023I want to begin by apologizing for the tardiness of this review - my whole family caught Covid during the review period, and it has taken several weeks for us to be functional again.OverviewAs a genomics data analyst, I found this manuscript to be a fascinating, inspiring, and, quite honestly, intimidating, view into the process of making analysis code and workflow truly meet FAIR standards. I have added recommendations below for elements to add to the manuscript that would help myself and other analysts use your case study to plan out our own workflows and code release. These recommendations fall quite solidly into the "Minor Revision" category and may require some editorial oversight as this article type is new to me. Please note that I only had access to the main text of the manuscript while writing this review.Specific Comments1) As a case study, it would be useful to have more explicit discussion of the expertise and effort involved in the FAIR code release and the anticipated cost/benefit ratio:As a data analyst, I have a deep, vested interest in reproducible science and improved workflow/code reusability, but also a limited bandwidth. For me, your overview of the process of producing a FAIR code release was both inspiring and daunting, and left me with many questions about the feasibility of following in your footsteps. The value of your case study would be greatly enhanced by discussing cost/benefit in more detail:a. What sort of expertise or training was required to complete each step in the FAIR release? E.g.,i. Was your use of tools like Github, Jupyter notebook, WorkflowHub, and DockerHub something that could be completed by a scientist with introductory training in these tools, or did it require higher level use?ii. Was there any particular training required for the production of high quality user documentation or metadata? (e.g., navigating ontologies?)b. With this expertise/training in place, how much time and effort do you estimate that it took to complete each step of adapting your analysis workflow and code release to meet FAIR standards?i. Do you think this time and effort would differ if an analyst planned to meet FAIR standards for analysis code prior to initiating the analysis versus deciding post-hoc to make the release of previously created code fit FAIR standards?c. The introduction provides an excellent overview of the potential benefits of releasing FAIR analysis code/workflows. How did these benefits end up playing out within your specific case study?i. e.g., I thought this sentence in your discussion was a particularly important note about the benefits of FAIR analysis code in your study: "Developing workflows with partners across multiple institutions can pose a challenge and we experienced that a secure shared computing environment was key to the success of this project."ii. Has the FAIR analysis workflow also been useful for collaboration or training in your lab?iii. How many of the analysis modules (or other aspects of the pipeline) do you plan on reusing? In general, what do you think is the size for the audience for reuse of the FAIR code? (e.g., how many people do you think will have been saved significant amounts of work by you putting in this effort?)iv. … Or is the primary benefit mostly just improving the transparency/reproducibility of your science?d. If there is any way to easily overview these aspects of your case study (effort/time, expertise, immediate benefits) in a table or figure, that would be ideal. This is definitely the content that I would be skimming your paper to find.2) As a reusable code workflow, it would be useful to provide additional information about the data input and experimental design, so that readers can determine how easily the workflow could be adapted to their own datasets. This information could be added to the text or to Fig 1. E.g.,i. The dimensionality of the input (sample size, number of independent variables & potential co-variates, number of dependent variables in each dataset, etc)ii. Data types for the independent variables, co-variates, and dependent variables (e.g., categorical, numeric, etc)iii. Any collinearity between independent variables (e.g., nesting, confounding).3) As documentation of the analysis, it would be useful to provide additional information about how the analysis workflow may influence the interpretation of the results.a. It would be especially useful to know which aspects of the analysis were preplanned or following a standard procedure/protocol, and which aspects of the analysis were customized after reviewing the data or results. This information can help the reader assess the risk of overfitting or HARKing.b. It would also be useful to call out explicitly how certain analysis decisions change the interpretation of the results. In particular, the decision to use dimension reduction techniques within the analysis of both the independent and dependent variables, and then focus only on the top dimensions explaining the largest sources of variation within the datasets, is especially important to justify and describe its impact on the interpretation of the results. Is there reason to believe that externalizing behavior should be related to the largest sources of variation within buccal DNA methylation or urinary metabolites? Within genetic analyses, the assumption tends to be the opposite - that genetic variation related to behavior (such as externalizing) is likely to be present in a small percent of the genome, and that the top sources of variation within the genetics dataset are uninteresting (related to population) and therefore traditionally filtered out of the data prior to analysis. Within transcriptomics, if a tissue is involved in generating the behavior, some of the top dimensions explaining the largest sources of variation in the dataset may be related to that behavior, but the absolute largest sources of variation are almost always technical artifacts (e.g., processing batches, dissection batches) or impactful sources of biological noise (e.g., age, sex, cell type heterogeneity in the tissue). Is there reason to believe that cheek cells would have their main sources of epigenetic variation strongly related to externalizing behavior? (maybe as a canary in a coal mine for other whole organism events like developmental stress exposure?). Is there reason to believe that the primary variation in urinary metabolites would be related to externalizing behavior? (perhaps as a stand-in for other largescale organismal states that might be related to the behavior - hormonal states? metabolic states? inflammation?). Since the goal of this paper is to provide a case study for creating a FAIR data analysis workflow, it is less important that you have strong answers for these questions, and more important that you are transparent about how the answers to these questions change the interpretation of your results. Adding a few sentences to the discussion is probably sufficient to serve this purpose. Thank you for your hard work helping advance our field towards greater transparency and reproducibility. I look forward to seeing your paper published so that I can share it with the other analysts in our lab.
2. GigaScience 11 Feb 2024
  
  in GigaScience
  
  Background Applying good data management and FAIR data principles (Findable, Accessible, Interoperable, and Reusable) in research projects can help disentangle knowledge discovery, study result reproducibility, and data reuse in future studies. Based on the concepts of the original FAIR principles for research data, FAIR principles for research software were recently proposed. FAIR Digital Objects enable discovery and reuse of Research Objects, including computational workflows for both humans and machines. Practical examples can help promote the adoption of FAIR practices for computational workflows in the research community. We developed a multi-omics data analysis workflow implementing FAIR practices to share it as a FAIR Digital Object.Findings We conducted a case study investigating shared patterns between multi-omics data and childhood externalizing behavior. The analysis workflow was implemented as a modular pipeline in the workflow manager Nextflow, including containers with software dependencies. We adhered to software development practices like version control, documentation, and licensing. Finally, the workflow was described with rich semantic metadata, packaged as a Research Object Crate, and shared via WorkflowHub.Conclusions Along with the packaged multi-omics data analysis workflow, we share our experiences adopting various FAIR practices and creating a FAIR Digital Object. We hope our experiences can help other researchers who develop omics data analysis workflows to turn FAIR principles into practice.
  
  Reviewer 2 Dominique Batista - Original Submission
  
  Very good paper on the FAIR side. You detail what were the challenges, in particular when it comes to the selection of ontologies and terms.It is unclear if the generation of the ISA metadata is included in the workflow. Can a user generate the metadata for the synthetic dataset or their own data using the workflow ?Adding a GitHub action running the workflow with the synthetic data would help reusability but is not required for the publication of the paper.
3. GigaScience 11 Feb 2024
  
  in GigaScience
  
  Background Applying good data management and FAIR data principles (Findable, Accessible, Interoperable, and Reusable) in research projects can help disentangle knowledge discovery, study result reproducibility, and data reuse in future studies. Based on the concepts of the original FAIR principles for research data, FAIR principles for research software were recently proposed. FAIR Digital Objects enable discovery and reuse of Research Objects, including computational workflows for both humans and machines. Practical examples can help promote the adoption of FAIR practices for computational workflows in the research community. We developed a multi-omics data analysis workflow implementing FAIR practices to share it as a FAIR Digital Object.Findings We conducted a case study investigating shared patterns between multi-omics data and childhood externalizing behavior. The analysis workflow was implemented as a modular pipeline in the workflow manager Nextflow, including containers with software dependencies. We adhered to software development practices like version control, documentation, and licensing. Finally, the workflow was described with rich semantic metadata, packaged as a Research Object Crate, and shared via WorkflowHub.Conclusions Along with the packaged multi-omics data analysis workflow, we share our experiences adopting various FAIR practices and creating a FAIR Digital Object. We hope our experiences can help other researchers who develop omics data analysis workflows to turn FAIR principles into practice.
  
  This work has been published in GigaScience Journal under a CC-BY 4.0 license (https://doi.org/10.1093/gigascience/giad115), and has published the reviews under the same license. These are as follows.
  
  Reviewer 1 Carole Goble - Original Submission
  
  This work reports a multi-omics data analysis workflow packaged as a RO-Crate, an implementation of a FAIR Digital Object.We limit our comments to the technical aspects of the Research Object and workflow packaging. The scientific validity of the omics analysis itself is outside our expertise.The paper is comprehensive and the background grounding in the current state of the art is excellent and thorough. The paper is an excellent exemplar of the future of data analysis reporting for FAIR and reproducible computational methods, and the amount of work impressive. We congratulate the authors.WorkflowHub entry https://workflowhub.eu/workflows/402?version=5# gives a comprehensive report of the Nextflow workflow and its multiple versions, all the files including the R scripts and the synthetic data. The RO-Crate rendering looks correct and version-locking the R containers is following best practice(https://github.com/Xomics/ACTIONdemonstrator_workflow/blob/main/nextflow.config#L44)T he paper also highlights the amount of work needed to make such a pipeline to be both metadata machine processable and metadata human readable.To make this pipeline reproducible requires a mixture of notebooks submitted as supplementary materials, the Nextflow workflow with its R scripts represented as an RO-Crate in WorkflowHub and a README is linked to the container recipes in https://github.com/Xomics/Docker_containers and then another Documentation.md file. There seems to be the potential for duplicated effort in reporting the necessary metadata describing the workflow that could be highlighted in the Discussion as a steer to the digital object community.- Could the ROCrate approach be widened beyond the current Workflow RO-Crate, and would there be value in streamlining the metadata, or is this just an artefact of the need for multiple descriptions and ease of publishing. If the JSON within the RO-Crate was more richly annotated, could some of the Documentation.md be avoided altogether, and is that even desirable?- The README includes the container/software packaging and is not linked from the RO-Crate (and there isn't an obvious property to link to it yet). Could these be RO-Crates too?- The notebooks in the supplementary files could also be registered in WorkflowHub and linked to the Nextflow workflow (see https://workflowhub.eu/workflows?filter%5Bworkflow_type%5D=jupyter).- Is it feasible and desirable to have a single RO-Crate linked to many other RO-Crates to represent the whole reproducible pipeline in full?In the discussion the FAIR principles verification through different practices and approaches would be more helpful if it was more precise. Comments seem to be limited to the Workflow RO-Crate and use of ontologies for machine readability. As highlighted in table 1 there is more to FAIR software & workflows than this.Minor remarksKey points- We here demonstrate the implementation multiomics data -> We here demonstrate an implementation of an multi-omics data.Background- The documentation of dependencies is highlighted as a prerequisite for software interoperability. In the FAIR4RS principles I2 also highlights qualified references to other objects - presumably other software or installation requirements. This highlights the relationship between software interoperability and software portability. It seems that dependencies more relate to portability rather than interoperability.- "Based on the FDO concept, the RO-Crate approach was specified". This is a confusing statement. ROCrates have been recognised as an implementation approach for the FDO concept as proposed by the FDO Forum. For more discussion on FDO and the Linked Data approach promoted by RO-Crates see https://arxiv.org/abs/2306.07436. However, RO-Crates are not based in the FDO - they are based on the Research Object packaging work that emerged from the EU Wf4ever project, (see https://doi.org/10.1016/j.future.2011.08.004 from 2013).- It is better to describe the RO-Crate metadata file as " It contains all contextual and non-contextual related data to re-run the workflow". Instead of "It can additionally contain data on which the workflow can be run."Workflow Implementation- At the beginning of the last paragraph, "Besides the workflow and the synthetic data set" replace with "As well as the workflow and the synthetic data set".- https://workflowhub.eu/workflows/402?version=5# gives a very nice pictorial overview of the workflow that you may consider including in the paper itself.
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2023.06.07.543986v1
www.biorxiv.org www.biorxiv.org

Machine Learning Made Easy (MLme): A Comprehensive Toolkit for Machine Learning-Driven Data Analysis

5
1. GigaScience 11 Feb 2024
  
  in GigaScience
  
  Background Machine learning (ML) has emerged as a vital asset for researchers to analyze and extract valuable information from complex datasets. However, developing an effective and robust ML pipeline can present a real challenge, demanding considerable time and effort, thereby impeding research progress. Existing tools in this landscape require a profound understanding of ML principles and programming skills. Furthermore, users are required to engage in the comprehensive configuration of their ML pipeline to obtain optimal performance.Results To address these challenges, we have developed a novel tool called Machine Learning Made Easy (MLme) that streamlines the use of ML in research, specifically focusing on classification problems at present. By integrating four essential functionalities, namely Data Exploration, AutoML, CustomML, and Visualization, MLme fulfills the diverse requirements of researchers while eliminating the need for extensive coding efforts. To demonstrate the applicability of MLme, we conducted rigorous testing on six distinct datasets, each presenting unique characteristics and challenges. Our results consistently showed promising performance across different datasets, reaffirming the versatility and effectiveness of the tool. Additionally, by utilizing MLme’s feature selection functionality, we successfully identified significant markers for CD8+ naive (BACH2), CD16+ (CD16), and CD14+ (VCAN) cell populations.Conclusion MLme serves as a valuable resource for leveraging machine learning (ML) to facilitate insightful data analysis and enhance research outcomes, while alleviating concerns related to complex coding scripts. The source code and a detailed tutorial for MLme are available at https://github.com/FunctionalUrology/MLme.
  
  Reviewer 2 Ryan J. Urbanowicz Revision 2
  
  At this point I earnestly wish to see this paper published, and in acknowledging my own potential bias as a developer of STREAMLINE and participant in the development of TPOT, I am still recommending minor revision. At minimum, for me to recommend acceptance of this paper the following small but critical issue needs to be addressed, otherwise I must recommend reject. I believe this concern is well justified by scientific standards. I also still strongly recommend the authors re-consider the other non-critical issues reiterated below as a way to make their paper stronger and to be better received by the scientific community. If the journal editor disagrees with my assessment, I would still be happy to see this work published, however I must stand by my assertions below.Critical Issue:Limitations section: The authors updated the text - "excells in it's core objective of addressing classification tasks." To "it excels in its primary objective of addressing pipeline development for classification tasks.The use of the word 'excells' is the key problem, as this word is defined as "to do or be better than others". While the change in phrasing correctly no longer implies that MLme performed better than the other evaluated AutoML tools, it does still imply that it is the best in developing a pipeline for classification tasks, but no specific evidence is provided in the paper to support this assertion. I.e. there were no studies comparing how easy the tool was for users to apply than other autoML, and no detailed comparison of what pipeline elements could be included by MLme vs other autoML or pipeline development tools. The fact that MLme doesn't include hyperparameter optimization is in itself a limitation that I think would prevent MLme from being claimed as excelling or superior in pipeline development to other tools/platforms, even if it's easier to use that other tools. As phrased in the reviewer response, the authors could say that MLme is well-equipped to handle pipeline development as this would be a fair statement. All together I'd strongly encourage the authors not to make statements about the superior aspects of MLme without clearly backing up these statements with direct comparisons. Instead I'd suggest highlighting elements of MLme that are 'unique' or provide more functionality in contrast with other tools. In the reviewer response the authors make the claim that MLme is superior in terms of ease of use for visualization and exploratory analysis. If they want to make that statement in the paper backed up by accurate comparisons to other tools, I'd agree with that addition.Non-Critical Issues that I feel still should be addressed:1. Table S1 has been updated to remove the inaccuracies I previously pointed out, however this alone does not change the broader concern I had regarding the intention of this table (which is to highlight the parts of MLme that appear better than other AutoML tools without fairly pointing out the limitations of MLme in contrast with other tools). As a supplemental materials table, I do not feel this is critical, but I think making a table that more fairly reflects strengths and limitations of different tools would greatly strengthen this paper.2. The pipeline design in Figure 2 and and S10 are both high-level and still do not provide enough detail/clarity to understand exactly what happens and in what order when applying the autoML element of MLme. They key words here being transparency and reproducibility. The supplemental materials could describe a detailed walk through of what the autoML does at each step. At minimum this could also be clearly addressed in the software documentation on GitHub.3. While I understand the need for brevity, I think the addition of a sentence that indicates specifically what AutoML tools are most similar to MLme is a reasonable request that better places MLme in the context of the greater AutoML research space.
2. GigaScience 11 Feb 2024
  
  in GigaScience
  
  Background Machine learning (ML) has emerged as a vital asset for researchers to analyze and extract valuable information from complex datasets. However, developing an effective and robust ML pipeline can present a real challenge, demanding considerable time and effort, thereby impeding research progress. Existing tools in this landscape require a profound understanding of ML principles and programming skills. Furthermore, users are required to engage in the comprehensive configuration of their ML pipeline to obtain optimal performance.Results To address these challenges, we have developed a novel tool called Machine Learning Made Easy (MLme) that streamlines the use of ML in research, specifically focusing on classification problems at present. By integrating four essential functionalities, namely Data Exploration, AutoML, CustomML, and Visualization, MLme fulfills the diverse requirements of researchers while eliminating the need for extensive coding efforts. To demonstrate the applicability of MLme, we conducted rigorous testing on six distinct datasets, each presenting unique characteristics and challenges. Our results consistently showed promising performance across different datasets, reaffirming the versatility and effectiveness of the tool. Additionally, by utilizing MLme’s feature selection functionality, we successfully identified significant markers for CD8+ naive (BACH2), CD16+ (CD16), and CD14+ (VCAN) cell populations.Conclusion MLme serves as a valuable resource for leveraging machine learning (ML) to facilitate insightful data analysis and enhance research outcomes, while alleviating concerns related to complex coding scripts. The source code and a detailed tutorial for MLme are available at https://github.com/FunctionalUrology/MLme.
  
  **Reviewer 2 Ryan J. Urbanowicz ** Revision 1
  
  Overall I think the authors have made some good improvements to this paper, although it does not seem like the main body of the paper has changed much with most of the updates going into supplemental materials. However, I think this work is worthy of publication once the following items are addressed. (which I still feel strongly should be addressed, but should be fairly easy to do so).
  
  Limitations section: While the authors added some basic comparisons to a few other AutoML tools, I do not see how they are justified in saying that MLme 'excells' in it's core objective of addressing classification tasks. This implies it is better performing a classification than other methods, which is not at all backed up here, and indeed would be very difficult to prove as it would require a huge amount of analyes over a broad range of simulated and real world benchmark datasets, and incomparison to many or all orther other autoML tools. At best i think the authors can say here that it is at least comparable in performance to AutoML tools (X, Y, Z) in its ability to conduct classification analyses. And according to Figure S9 this is only across 7 datasets, and focused only on the F1 score which could also be missleading or cherry picked. At best I believe the authors can say in the paper that "Initial evaluation across 7 datasets suggested that MLMe performed comparably to TPOT and Hyperopt-sklearn with respect to F1 score performance. This suggests that MLme is effective as an automated ML tool for classification tasks. " (or something similar).
  
  While the authors lengthened the supplemental materials table comparing ML algorithms (mainly by adding some other autoML tools, this table is intentionally presenting the capabilities of tools in a way that make it appear like MLme does the most (with the exception of the 'features' column) . For example, what about a column to indicate if an autoML tool has an automated pipeline discovery component (like TPOT)? In terms of AutoML, this table is structured to highlight the benefits of MLme, rather than give a fair comparison of AutoML tools (which is my major concern here). In terms of AutoML performance and usability there is alot more to these different tools than the 6 columns presented. In this table 'features' seems like an afterthought, but is arguably the most important aspect of an AutoML.
  
  Additionally, the information presented in the autoML comparison table does not seem to be entirely accurate, or at least how the columns are defined is not made entirely clear. Looking at STREAMLINE, which can be run by users with no coding experience (as a google colab notebook), it has a code free option (just not a GUI), STREAMLINE also generates more than two exploratory analysis plots, and more results visualizations plots than indicated). While I agree that MLme has many more ease of use functionality in comparison to STREAMLINE (which is a very nice plus), a reader might look at this table and think they need to know how to code in order to use STREAMLINE, which is not the case. Could the authors at least define their criteria for the "code free" column. As it's presented now it seems to be the same exact criteria as for GUI (in which case this is redundant). The same is true for the legend for the table where '*' indicates that coding experience is required for designing a custom pipeline. This requires more clarification, as STREAMLINE can be customized easily without coding experience by simply changing options in the Google Colab notebook, and TPOT automatically discovers new analysis pipelines which isn't reflected at all.
  
  While I appreciate the authors adding a citation for STREAMLINE and some other autoML tools not previously cited, it would be nice for the authors to discuss other AutoML tools further in their main paper, as well as to acknowledge in the main paper which AutoML tools are most similar to MLme in overall design and capabilities. Based on my own review of AutoML tools the most similar tools would include STREAMLINE and MLIJAR-supervised.
  
  I like the addition of Figure S10 that more clearly lays out the elements included in MLme, but I still think the paper and documentation lacks a clear and transparent walk through of exactly what happens to the data and how the analyses are conducted from start to finish when using the AutoML (at least by default). This is important to trusting what happens under the hood for reporting results, etc.
  
  Other comments responding to author responses: * I still disagree with the authors that a dataset with up to 1500 samples or up to 5520 features could be considered large by today's standards across all research domains. Even within biomedical data, datasets up to 100K subjects are becoming common, and 'omics' datasets regularly reach hundreds of thousands to multiple millions of features. I am glad to see the authors adding a larger dataset, but i would still be cautions when making suggestions about how well MLme handles 'large' datasets without including specifics for context. However ultimately this is subjective, and not preventing me from endorsing publication. * I also disagree that MLme isn't introducing a new methodology. The steps comprising an AutoML tool can be considered in itself a new methodology, even if it is built on established components, because there are still innumerable ways to put a machine learning analysis pipeline together that adds bias, data leakage, or just yields poorer performance. Thus I also don't think it's fair to just 'assume' your method will work as well as other AutoML tools, especially when you've ran it on a limited number of datasets/problems.
3. GigaScience 11 Feb 2024
  
  in GigaScience
  
  Background Machine learning (ML) has emerged as a vital asset for researchers to analyze and extract valuable information from complex datasets. However, developing an effective and robust ML pipeline can present a real challenge, demanding considerable time and effort, thereby impeding research progress. Existing tools in this landscape require a profound understanding of ML principles and programming skills. Furthermore, users are required to engage in the comprehensive configuration of their ML pipeline to obtain optimal performance.Results To address these challenges, we have developed a novel tool called Machine Learning Made Easy (MLme) that streamlines the use of ML in research, specifically focusing on classification problems at present. By integrating four essential functionalities, namely Data Exploration, AutoML, CustomML, and Visualization, MLme fulfills the diverse requirements of researchers while eliminating the need for extensive coding efforts. To demonstrate the applicability of MLme, we conducted rigorous testing on six distinct datasets, each presenting unique characteristics and challenges. Our results consistently showed promising performance across different datasets, reaffirming the versatility and effectiveness of the tool. Additionally, by utilizing MLme’s feature selection functionality, we successfully identified significant markers for CD8+ naive (BACH2), CD16+ (CD16), and CD14+ (VCAN) cell populations.Conclusion MLme serves as a valuable resource for leveraging machine learning (ML) to facilitate insightful data analysis and enhance research outcomes, while alleviating concerns related to complex coding scripts. The source code and a detailed tutorial for MLme are available at https://github.com/FunctionalUrology/MLme.
  
  Reviewer 1 Joe Greener Revision 1
  
  The authors have adequately addressed my concerns and I believe that the manuscript is ready for publication.
4. GigaScience 11 Feb 2024
  
  in GigaScience
  
  Background Machine learning (ML) has emerged as a vital asset for researchers to analyze and extract valuable information from complex datasets. However, developing an effective and robust ML pipeline can present a real challenge, demanding considerable time and effort, thereby impeding research progress. Existing tools in this landscape require a profound understanding of ML principles and programming skills. Furthermore, users are required to engage in the comprehensive configuration of their ML pipeline to obtain optimal performance.Results To address these challenges, we have developed a novel tool called Machine Learning Made Easy (MLme) that streamlines the use of ML in research, specifically focusing on classification problems at present. By integrating four essential functionalities, namely Data Exploration, AutoML, CustomML, and Visualization, MLme fulfills the diverse requirements of researchers while eliminating the need for extensive coding efforts. To demonstrate the applicability of MLme, we conducted rigorous testing on six distinct datasets, each presenting unique characteristics and challenges. Our results consistently showed promising performance across different datasets, reaffirming the versatility and effectiveness of the tool. Additionally, by utilizing MLme’s feature selection functionality, we successfully identified significant markers for CD8+ naive (BACH2), CD16+ (CD16), and CD14+ (VCAN) cell populations.Conclusion MLme serves as a valuable resource for leveraging machine learning (ML) to facilitate insightful data analysis and enhance research outcomes, while alleviating concerns related to complex coding scripts. The source code and a detailed tutorial for MLme are available at https://github.com/FunctionalUrology/MLme.
  
  ** Reviewer 2 Ryan J. Urbanowicz ** Original Submission
  
  In this paper the authors introduce MLme, a comprehensive toolkit for machine-learning driven analysis. The authors discuss the benefits and limitations of their toolkit and provide a demonstration evaluation on 6 datasets suggesting it's potential value. Overall MLme seems like a nice, easy to use tool with a good deal of potential and value. However as the developer of STREAMLINE, an AutoML toolkit with a number of very similar goals and design architecture to MLme it was very surprising to have it not referenced or compared to in this paper. My major concerns involve the limited details about what this specifically does/includes (e.g. what 16 ML algorithms are built in), as well as what seems like a limited and largely biased comparison of this toolkit's capabilities to other autoML tools (most specifically STREAMLINE which has a significant degree of similarity).- There are many other autoML tools out there that that authors have not considered in their Table S1 or referenced. Eg. MLBox, AutoWeka, H20, Devol, Auto-Keras, TransmorgriffAI, and most glaringly in for this reviewer, STREAMLINE (https://github.com/UrbsLab/STREAMLINE).- In particular, with respect to STREAMLINE (https://link.springer.com/chapter/10.1007/978-981-19-8460-0_9), there are a large number of pipeline similarities and a similar analysis mission/goals to MLme that make it extremely relevant to cite and contrast to in this manuscript as well as in Table S1. STREAMLINE has a similar focus on the end-to-end ML analysis pipeline including automated exploratory analysis, data processing, feature selection, ML modeling with 16 algorithms, evaluation, and results visualization generation, interactive visualizations, pickled output storage, etc. The first STREAMLINE paper was published March of 2023, and a preprint of that manuscript published June 2022, as well as a precursor implementation of this pipeline published as a preprint in Aug of 2020 (https://arxiv.org/abs/2008.12829). This in contrast with MLme's preprint published July of 2023. While MLme has a number of potentially nice features that STREAMLINE does not (i.e. a GUI interface, spider plots, easy color palate selection, inclusion of a dummy classifier, ability to handle multi-class classification [which is not yet available, but in development for STREAMLINE along with regression]), it lacks other potentially important features that STREAMLINE does have (i.e. automated hyperparameter optimization, basic data cleaning and feature engineering [in the newest release], collective feature selection, pickled models for later reuse, collective feature importance visualizations, a pdf analysis summary report, the ability to quickly evaluate models on new replication data, and potentially other capabilities that I can't highlight because of limited details on what MLme includes). The absence of hyperparameter optimization is a particularly problematic omission from MLme, as this a fairly critical element of any machine learning analysis pipeline.-Table S1 should be expanded to highlight a broader range of toolkit features to better highlight the strengths and weaknesses of a greater variety of methodologies. The present table seems a bit cherry picked to make MLme stand out as appearing to have more capabilities than other tools, but there are uncaptured advantages to these other approaches.-This manuscript includes no citations justifying their pipeline design choices. In particular, I'm most concerned with the author's justification of automatically including data resampling by default as it is well known that this can introduce bias in modeling. It's also not clear what determines if data resampling is required, and whether this only impacts training data or also testing data.- Its not clear that resampling is a good/reliable strategy for an automated machine learning framework since data resampling to create a more balanced dataset can also incorporate bias in to an ML model.- In the context of potential datasets from different domains (including biomedical data), the datasets identified in this paper as being "large" have only up to 1500 sample and only up to 5520 features, which would not be considered large by most data scientist standards.- There are largely limited details in this paper and the software's github documentation in terms of transparently indicating exactly what this pipeline does, and what options, algorithms, evaluation metrics, and visualizations it includes.- Since the authors do not benchmark MLme against any other autoML tool and they have a very limited set of benchmarked datasets (6 total, with limited diversity of data types, sizes, feature types), I don't think it's fair to claim that their methodology necessarily excels in it's core objective of addressing classification tasks. Ideally the authors would conduct benchmarking comparisons to STREAMLINE, as well as other autoML toolkits, however this might also understandably be outside the scope of this current paper. I do suggest the authors be more conservative in what assertions they make and conclusions they draw with respect to MLme. The authors might consider using established ML or AutoML benchmark benchmark datasets used by other algorithms and frameworks to compare or facilitate comparison of their pipeline toolkit to others.
5. GigaScience 11 Feb 2024
  
  in GigaScience
  
  Background Machine learning (ML) has emerged as a vital asset for researchers to analyze and extract valuable information from complex datasets. However, developing an effective and robust ML pipeline can present a real challenge, demanding considerable time and effort, thereby impeding research progress. Existing tools in this landscape require a profound understanding of ML principles and programming skills. Furthermore, users are required to engage in the comprehensive configuration of their ML pipeline to obtain optimal performance.Results To address these challenges, we have developed a novel tool called Machine Learning Made Easy (MLme) that streamlines the use of ML in research, specifically focusing on classification problems at present. By integrating four essential functionalities, namely Data Exploration, AutoML, CustomML, and Visualization, MLme fulfills the diverse requirements of researchers while eliminating the need for extensive coding efforts. To demonstrate the applicability of MLme, we conducted rigorous testing on six distinct datasets, each presenting unique characteristics and challenges. Our results consistently showed promising performance across different datasets, reaffirming the versatility and effectiveness of the tool. Additionally, by utilizing MLme’s feature selection functionality, we successfully identified significant markers for CD8+ naive (BACH2), CD16+ (CD16), and CD14+ (VCAN) cell populations.Conclusion MLme serves as a valuable resource for leveraging machine learning (ML) to facilitate insightful data analysis and enhance research outcomes, while alleviating concerns related to complex coding scripts. The source code and a detailed tutorial for MLme are available at https://github.com/FunctionalUrology/MLme.
  
  This work has been published in GigaScience Journal under a CC-BY 4.0 license (https://doi.org/10.1093/gigascience/giad111), and has published the reviews under the same license. These are as follows.
  
  ** Reviewer 1 Joe Greener ** Original Submission
  
  Akshay et al. present MLme, a toolkit for exploring data and automatically running machine learning (ML) models. This software could be useful for those with less experience in ML. I believe it is suitable for publication provided the following points are addressed.# Major1. The performance of models is consistently over 90% but without a reference point it is unclear how good this is. Are there results from previous studies on the same data that can be compared to, with a table comparing accuracy with MLme to previous work? Otherwise it is unclear whether MLme is supposed to be a quick way to have a first go at prediction on the data or can entirely replace manual model refinement.2. With any automated ML system it is important to impress upon users the risks of ML. For example, the splitting of data into training and test sets is done randomly, but there are cases where this is not appropriate as it will lead to data leakage between the training and test sets. This could be mentioned in the manuscript and somewhere on the GUI. There isn't really a replacement for domain knowledge, and users of MLme should have this in mind when using the software.# Minor3. More experienced ML users may want to use the software to have a first go at prediction on the data. For these users it may be useful to provide access to commands or scripts, or at least information on which functions were used, as additional options in the GUI. Users could then run these scripts themselves to tweak hyperparameters etc.4. The visualisation tab lacks an info button by the file upload to say what the file format should be.
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2023.07.04.546825v1
www.biorxiv.org www.biorxiv.org

Vulture: Cloud-enabled scalable mining of microbial reads in public scRNA-seq data

4
1. GigaScience 11 Feb 2024
  
  in GigaScience
  
  The rapidly growing collection of public single-cell sequencing data have become a valuable resource for molecular, cellular and microbial discovery. Previous studies mostly overlooked detecting pathogens in human single-cell sequencing data. Moreover, existing bioinformatics tools lack the scalability to deal with big public data. We introduce Vulture, a scalable cloud-based pipeline that performs microbial calling for single-cell RNA sequencing (scRNA-seq) data, enabling meta-analysis of host-microbial studies from the public domain. In our scalability benchmarking experiments, Vulture can outperform the state-of-the-art cloud-based pipeline Cumulus with a 40% and 80% reduction of runtime and cost, respectively. Furthermore, Vulture is 2-10 times faster than PathogenTrack and Venus, while generating comparable results. We applied Vulture to two COVID-19, three hepatocellular carcinoma (HCC), and two gastric cancer human patient cohorts with public sequencing reads data from scRNA-seq experiments and discovered cell-type specific enrichment of SARS-CoV2, hepatitis B virus (HBV), and H. pylori positive cells, respectively. In the HCC analysis, all cohorts showed hepatocyte-only enrichment of HBV, with cell subtype-associated HBV enrichment based on inferred copy number variations. In summary, Vulture presents a scalable and economical framework to mine unknown host-microbial interactions from large-scale public scRNA-seq data. Vulture is available via an open-source license at https://github.com/holab-hku/Vulture.
  
  ** Reviewer 1 Liuyang Zhao ** R1 version
  
  The manuscript presented by the authors provides a useful tool on the microbiome, which named "Vulture: Cloud-enabled scalable mining of microbial reads in public scRNA-seq data", using a large and valuable dataset. The study is important in deepening our understanding of "microbiome in public data". However, the author comments not fully address my concerned, there are some issues for improvement in the manuscript. Here are the requirements for new software that is good enough to be published: 1. A docker provided is better, however, most used install method conda is still missing. 2. The more microbial detect example is missing. Can you provide an example of using like Kraken2 full NCBI database (RefSeq) to check all the microbial is more useful. 3. Author still not promotion his software in social media. If no more people take part in use it, how can we know it's useful? The reviewers still have may work to do. Not have enough time to test this software. Just promote it in twitter and Chinese WeChat will help software better. 4. The software name should be unique, which is convenient to count the real users through all available resources (such as QIIME, ImageGP, and EasyAmplicon). However, the name vulture is unacceptable, due to millions of hits in Google scholar. Must be no hit is a unique nameï¼ŒOK? Otherwise, hardly to know the read number of users. 5. The source code to support the generation of individual figures in this paper will be available on the GigaDB after being published. Where to check by the reviewers?
2. GigaScience 11 Feb 2024
  
  in GigaScience
  
  The rapidly growing collection of public single-cell sequencing data have become a valuable resource for molecular, cellular and microbial discovery. Previous studies mostly overlooked detecting pathogens in human single-cell sequencing data. Moreover, existing bioinformatics tools lack the scalability to deal with big public data. We introduce Vulture, a scalable cloud-based pipeline that performs microbial calling for single-cell RNA sequencing (scRNA-seq) data, enabling meta-analysis of host-microbial studies from the public domain. In our scalability benchmarking experiments, Vulture can outperform the state-of-the-art cloud-based pipeline Cumulus with a 40% and 80% reduction of runtime and cost, respectively. Furthermore, Vulture is 2-10 times faster than PathogenTrack and Venus, while generating comparable results. We applied Vulture to two COVID-19, three hepatocellular carcinoma (HCC), and two gastric cancer human patient cohorts with public sequencing reads data from scRNA-seq experiments and discovered cell-type specific enrichment of SARS-CoV2, hepatitis B virus (HBV), and H. pylori positive cells, respectively. In the HCC analysis, all cohorts showed hepatocyte-only enrichment of HBV, with cell subtype-associated HBV enrichment based on inferred copy number variations. In summary, Vulture presents a scalable and economical framework to mine unknown host-microbial interactions from large-scale public scRNA-seq data. Vulture is available via an open-source license at https://github.com/holab-hku/Vulture.
  
  ** Reviewer 3 Liuyang Zhao ** Original submission
  
  The authors aim to develop Cloud-enabled approaches for detecting viral reads in public single-cell RNA sequencing (scRNA-seq) data. This study makes a significant contribution to the identification of viruses and bacteria in public scRNA-seq data. Although the outcomes are satisfactory, the novelty of the proposed methods is limited. To date, no evidence has been provided to demonstrate their superiority over recently published methods (such as PathogenTrack and Venus, et al) when executed on a local machine. There are also several issues that need to be further addressed, as highlighted below: 1.The documentation available on the GitHub pipeline does not explain how to utilize the latest virus database or how to incorporate a user's custom database. Because the virus database is updated very quickly now. It might be more appropriate if the author updates the database promptly or if one can customize and create their own database. 2. Figure 2a only has an overall comparison graph, it can be improved by adding detailed comparison graphs with Cumulus, PathogenTrack and Venus. 3. Figure 2b. The persuasiveness is not enough, it would be better to compare several pipeline platforms with similar functionalities or compare some specific steps, such as the four steps in figure 2a. By the way, all of these comparisons use comparison software developed by other same researchers, so please provide a detailed description of why the author's method is faster? 4. Figure 3c can be created with microbial clustering and non-microbial clustering to highlight the impact of virus identification on classification results. 5. Fig. S1 It should be the "Quality control on read level".
3. GigaScience 11 Feb 2024
  
  in GigaScience
  
  The rapidly growing collection of public single-cell sequencing data have become a valuable resource for molecular, cellular and microbial discovery. Previous studies mostly overlooked detecting pathogens in human single-cell sequencing data. Moreover, existing bioinformatics tools lack the scalability to deal with big public data. We introduce Vulture, a scalable cloud-based pipeline that performs microbial calling for single-cell RNA sequencing (scRNA-seq) data, enabling meta-analysis of host-microbial studies from the public domain. In our scalability benchmarking experiments, Vulture can outperform the state-of-the-art cloud-based pipeline Cumulus with a 40% and 80% reduction of runtime and cost, respectively. Furthermore, Vulture is 2-10 times faster than PathogenTrack and Venus, while generating comparable results. We applied Vulture to two COVID-19, three hepatocellular carcinoma (HCC), and two gastric cancer human patient cohorts with public sequencing reads data from scRNA-seq experiments and discovered cell-type specific enrichment of SARS-CoV2, hepatitis B virus (HBV), and H. pylori positive cells, respectively. In the HCC analysis, all cohorts showed hepatocyte-only enrichment of HBV, with cell subtype-associated HBV enrichment based on inferred copy number variations. In summary, Vulture presents a scalable and economical framework to mine unknown host-microbial interactions from large-scale public scRNA-seq data. Vulture is available via an open-source license at https://github.com/holab-hku/Vulture.
  
  ** Reviewer 2 Jingzhe Jiang ** Original submission
  
  In this study, Chen et al. introduce Vulture, a scalable cloud-based pipeline that performs microbial calling for single-cell RNA sequencing (scRNA-seq) data, enabling meta-analysis of host-microbial studies from the public domain. And they further applied Vulture to COVID-19, HCC, and gastric cancer human patient cohorts with public sequencing reads dataand discovered cell-type specific enrichment of SARSCoV2, hepatitis B virus (HBV), and H. pylori positive cells. Generally speaking, this study is innovative, has good application potential, and can better assist the work of single cell research from the point of view of infection. I only a few minor questions that need the author to reply: 1. Background: The first appearance of H. pylori should be replaced with its full name. 2. Methods-Downstream analysis of scRNA-seq samples: Why use different tools (SCANPY/Seurat, BBKNN/Harmony) to analyze different datasets instead of using the same tool to analyze different datasets? 3. Cell-type enrichment of microbial UMI: format error of formula. 4. Analyses-Page 11: "The statistical test identified that SARS-CoV-2 is enriched (p-value < 0.05) in epithelial cells, neutrophils, and plasma B cells (Fig. 3d and Table. 2)". It is best to highlight p < 0.05 data points in other colors rather than red squares. Why are there no p < 0.05 square in fig. 3e? 5. Fig. 2a and 2b: There are 8 colors in figure 2a, however only 4 figure legend were showed. What do the four light-colored bar mean? And the same to Fig 2b.
4. GigaScience 11 Feb 2024
  
  in GigaScience
  
  The rapidly growing collection of public single-cell sequencing data have become a valuable resource for molecular, cellular and microbial discovery. Previous studies mostly overlooked detecting pathogens in human single-cell sequencing data. Moreover, existing bioinformatics tools lack the scalability to deal with big public data. We introduce Vulture, a scalable cloud-based pipeline that performs microbial calling for single-cell RNA sequencing (scRNA-seq) data, enabling meta-analysis of host-microbial studies from the public domain. In our scalability benchmarking experiments, Vulture can outperform the state-of-the-art cloud-based pipeline Cumulus with a 40% and 80% reduction of runtime and cost, respectively. Furthermore, Vulture is 2-10 times faster than PathogenTrack and Venus, while generating comparable results. We applied Vulture to two COVID-19, three hepatocellular carcinoma (HCC), and two gastric cancer human patient cohorts with public sequencing reads data from scRNA-seq experiments and discovered cell-type specific enrichment of SARS-CoV2, hepatitis B virus (HBV), and H. pylori positive cells, respectively. In the HCC analysis, all cohorts showed hepatocyte-only enrichment of HBV, with cell subtype-associated HBV enrichment based on inferred copy number variations. In summary, Vulture presents a scalable and economical framework to mine unknown host-microbial interactions from large-scale public scRNA-seq data. Vulture is available via an open-source license at https://github.com/holab-hku/Vulture.
  
  This work has been published in GigaScience Journal under a CC-BY 4.0 license (https://doi.org/10.1093/gigascience/giad117), and has published the reviews under the same license. These are as follows.
  
  ** Reviewer 1 Yongxin Liu** Original submission
  
  The manuscript presented by the authors provides a useful tool on the virome, which named "Vulture: Cloud-enabled scalable mining of viral reads in public scRNA-seq data", using a large and valuable dataset. The study is important in deepening our understanding of "virome in public data". However, there are some issues for improvement in the manuscript. Here are the requirements for new software that is good enough to be published: Major comments: 1. The software, tested data and results are required to be uploaded on GitHub for peers to use, and conda and/or docker installation modes are recommended for software with complex dependencies. We will take software Star, Fork, and downloads of GitHub as one of the audience indicators. I found the GitHub links: https://github.com/holab-hku/Vulture. However, the readme.md show pipeline on AWS cloud. If I not have an AWS, how can I run it in my server. Now this project is only 2 stars. You need more people to take part in and interest in this project. 2. Software installation and User tutorial are required in Readme.md or Wiki in GitHub. Please provide step by step protocol to deploy it in the laptop or server. 3. A video of software download, installation, operation, and result display is required with a computer or server without any related software installed, to make sure that any new user can perform the whole process according to the tutorial. 4. The software is required to be posted on twitter and other social media, you can contact @ iMetaScience, @microbe_article etc. to get help in tweet or retweet. The number of Retweet, Like and View as one of the audience indicators. 5. Chinese is largest single langue science society. Provide the Chinese tutorial and video presentation of the software, contact meta-genome Official account for help to promote. The Number of readers, share and favorite also one of the audience indicators. 6. According to the feedback from users in all over the world, the author continuously maintains and optimizes the method to ensure its availability, ease of use and advancement. 7. The software name should be unique, which is convenient to count the real users through all available resources (such as QIIME, ImageGP, and EasyAmplicon). However, the name vulture is unacceptable, due to million of hits in Google scholar. 8. The figures in your papers are diversity. However, I cannot find enough visualization function in your pipeline. The pipeline for integrated software is easy, the specific and diversity visualization plan is difficult. All the authors want their analysis result is ready-to-published. 9. Why only focus on the virus? Can this pipeline to generated all the microbiome, which is more interest and overview of the microbes.
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2023.02.13.528411v1

GigaScience

Annotations: 876

Joined: September 13, 2019

Annotators

URL

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Tags

Annotators