1,071 Matching Annotations

Jul 2026
www.biorxiv.org www.biorxiv.org

Breaking the cold chain: solutions for room temperature preservation of mosquitoes leading to high quality reference genomes

2
1. GigaScience 10 Jul 2026
  
  in GigaScience
  
  AbstractThe Earth BioGenome Project (EBP) is a global endeavour to produce reference genomes for all described eukaryotic species. The majority of described species are arthropods, which tend to be small and require taxonomic expertise to identify to species level. Therefore, the ability to collect and preserve specimens in a suitable way for long read and Hi-C data generation using very simple approaches with minimal infrastructure is certain to be important in scaling up reference genome generation. Using Anopheles mosquitoes as an insect representative we evaluate how well different preservation liquids protect high molecular weight DNA, RNA, and nuclei for Hi-C when mosquitoes are held intact versus slightly squished. We find that squished samples stored in 100% ethanol and Allprotect held at room temperature for one week result in excellent preservation of both high molecular weight DNA and nuclei for Hi-C. Other tested buffers, including RNAlater, EDTA at several pHs, and DMSO Salt Solution (DESS) performed satisfactorily for long read data generation and RNA retrieval, but less ideally for Hi-C, which may have bigger negative impacts when aiming to generate data for organisms with larger genomes. Field collections requiring dry ice or dry shippers can be logistically challenging to arrange, are notoriously expensive, and DNA degrades rapidly if ultra-cold temperature is not maintained, which is devastating given how expensive and time consuming field work can be. Here we present multiple viable options for room temperature collection and/or shipment for arthropod samples. Further exploration across a broader range of species will hopefully enable cheaper and more widely available reference genome generation globally.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giag061), which carries out single-anonymized peer review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 2：
  
  This study uses Anopheles coluzzii as a model to systematically evaluate preservation methods for generating high-quality genomic and Hi-C data, addressing a question of clear practical relevance. The development of approaches that eliminate the need for cold-chain transport is a particularly important step forward for field-based genomics, where maintaining low temperatures is often logistically difficult. In this context, the work provides useful experimental evidence and practical guidance with broad potential value. That said, several aspects of the manuscript would benefit from further clarification and refinement, as outlined below： 1. The study is conducted exclusively on Anopheles coluzzii under controlled laboratory conditions. While the results are promising, the extent to which these preservation strategies can be applied to other arthropods remains unclear. Given the substantial diversity in cuticle structure, body size, and physiological properties across taxa, the authors should more explicitly discuss the limitations of extrapolating their findings. Inclusion of additional taxa would be ideal; alternatively, a more thorough discussion of potential constraints would improve the manuscript. 2. The evaluation of Hi-C data quality is primarily based on scaffolding performance, including analyses using both high-quality and fragmented assemblies. However, the Hi-C datasets were generated at very high sequencing depth (>100×), which may obscure differences among preservation treatments by compensating for variations in data quality. The authors should clarify how they distinguish genuine preservation effects from depth-related compensation. Additional analyses based on downsampled Hi-C datasets would provide a more realistic assessment of performance under typical sequencing conditions. 3. I am particularly interested in the chromosome anchoring rates observed in the Hi-C scaffolding analyses. It would be valuable for the authors to clarify whether Hi-C data generated from different preservation methods lead to differences in chromosome anchoring efficiency, as this is a key indicator of scaffolding quality. 4. The manuscript indicates that sequencing data are not yet publicly available due to their inclusion in a larger ENA project. In line with journal policies, the authors should provide a clear plan for data release, including expected timelines and accession numbers where possible. Furthermore, additional methodological details, particularly regarding genome assembly parameters and Hi-C data processing, would improve reproducibility and transparency. 5. In Table 1, the inclusion of ULI scaffolding appears somewhat abrupt, as ULI sequencing is not sufficiently introduced or contextualized prior to its presentation in the table. Although its role becomes clearer in the Results section, readers may find it difficult to fully understand its relevance at this stage. The authors may consider introducing ULI sequencing earlier in the Methods or Results, or providing a brief explanation in the table caption, to improve clarity and ensure a more coherent presentation.
  
  This manuscript presents a systematic evaluation of preservation methods for generating high-quality genomic and Hi-C data using Anopheles coluzzii as a model organism. The study addresses an important practical challenge in genomics, particularly for field-based sample collection where optimal preservation conditions are often difficult to achieve. The experimental design is generally well-structured, and the comparison across multiple preservation treatments provides useful insights for the community. However, several aspects of the study require further clarification and improvement. In particular, concerns remain regarding the generalizability of the findings beyond the focal species, the robustness of the statistical analyses, and the interpretation of Hi-C results under very high sequencing coverage. Additionally, issues related to data availability and methodological transparency should be addressed to ensure reproducibility. Addressing these points would substantially strengthen the manuscript. 1.The study is conducted exclusively on Anopheles coluzzii under controlled laboratory conditions. While the results are promising, the applicability of these preservation strategies to other arthropods remains unclear. Given the diversity in cuticle structure, body size, and physiology across taxa, the authors should clarify the extent to which their findings can be generalized. Inclusion of additional taxa or a more explicit discussion of limitations would strengthen the manuscript. 2.The evaluation of Hi-C data quality is based on scaffolding performance, including analyses using both high-quality and fragmented assemblies. However, the Hi-C datasets were generated at very high coverage (>100×), which may mask differences in preservation efficiency. The authors should clarify how they distinguish true preservation effects from sequencing depth-related compensation. Additional analyses using downsampled Hi-C data would provide a more realistic assessment of performance under typical conditions. 3.I am particularly interested in the chromosome anchoring rate of the genome assemblies in the Hi-C scaffolding analysis. It would be valuable for the authors to clarify whether Hi-C data generated using different preservation methods result in differences in chromosome anchoring efficiency. 5. The column headers “Self scaffolding” and “ULI scaffolding” in Table 1 are not sufficiently clear, making it difficult for readers to fully understand the intended meaning.
  
  The manuscript indicates that sequencing data are not yet publicly available due to their inclusion in a larger ENA project. In line with journal policies, the authors should provide a clear plan for data release, including expected timelines and accession numbers if available. Furthermore, greater detail in the Methods section—particularly regarding assembly parameters and Hi-C processing—would improve reproducibility.
2. GigaScience 10 Jul 2026
  
  in GigaScience
  
  AbstractThe Earth BioGenome Project (EBP) is a global endeavour to produce reference genomes for all described eukaryotic species. The majority of described species are arthropods, which tend to be small and require taxonomic expertise to identify to species level. Therefore, the ability to collect and preserve specimens in a suitable way for long read and Hi-C data generation using very simple approaches with minimal infrastructure is certain to be important in scaling up reference genome generation. Using Anopheles mosquitoes as an insect representative we evaluate how well different preservation liquids protect high molecular weight DNA, RNA, and nuclei for Hi-C when mosquitoes are held intact versus slightly squished. We find that squished samples stored in 100% ethanol and Allprotect held at room temperature for one week result in excellent preservation of both high molecular weight DNA and nuclei for Hi-C. Other tested buffers, including RNAlater, EDTA at several pHs, and DMSO Salt Solution (DESS) performed satisfactorily for long read data generation and RNA retrieval, but less ideally for Hi-C, which may have bigger negative impacts when aiming to generate data for organisms with larger genomes. Field collections requiring dry ice or dry shippers can be logistically challenging to arrange, are notoriously expensive, and DNA degrades rapidly if ultra-cold temperature is not maintained, which is devastating given how expensive and time consuming field work can be. Here we present multiple viable options for room temperature collection and/or shipment for arthropod samples. Further exploration across a broader range of species will hopefully enable cheaper and more widely available reference genome generation globally.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giag061), which carries out single-anonymized peer review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 1：
  
  This is a straightforward and valuable study. Demonstrating that breaking the cold chain is feasible has strong potential to transform how genomes can be obtained across much of biodiversity. In my view, the broader significance of this advance is somewhat underemphasized in the Introduction (Really just a short mention around lines 72-75). Based on firsthand experience in large field campaigns designed specifically to generate genomes from very small insects, I can attest that much of arthropod biodiversity (especially within small-bodied, hyperdiverse "dark taxa") stands to benefit substantially from approaches like this. The practical importance of scalable, field-friendly preservation solutions cannot be overstated.
  
  Overall, the study is well designed, clearly presented, and addresses a real logistical bottleneck in biodiversity genomics. The conclusions are generally well supported by the data presented. I have a few suggestions that I believe would further strengthen the manuscript and increase its practical value and reproducibility.
  
  Major / Substantive Comments
  
  Framing and impact As above, I recommend strengthening the Introduction's framing of downstream biodiversity applications, especially for small-bodied and taxonomically challenging groups where cold-chain logistics are often the primary limiting factor. The method has particularly high relevance for large-scale efforts targeting hyperdiverse insect groups ("dark taxa"), and this applied significance could be more explicitly highlighted to broaden the paper's audience and appeal.
  
  Practical protocol clarity and reproducibility Because this method is likely to be adopted by field teams, including non-specialists, it would be very helpful to include a concise protocol-style summary or workflow outlining the recommended handling steps, specimen size considerations, and timing constraints. A short practical guide or decision framework would improve reproducibility and uptake.
  
  Relatedly, laying out any known tolerance ranges in this fashion would strengthen the paper, for example, how sensitive outcomes are to delays in processing, temperature variation, or specimen size differences under field conditions.
  
  Preservation benchmarking I suggest including a summary comparison (table or figure) of yield/quality metrics across preservation treatments relative to standard cold-chain approaches. This would make performance differences easier for readers to interpret and apply.
  
  Voucher integrity and downstream taxonomic usability Given that many target organisms will come from taxonomically difficult groups, a short discussion of voucher integrity after treatment would be valuable. Guidance on expected morphological preservation and suitability for downstream taxonomic work would significantly increase the method's usefulness for specimen-based biodiversity genomics workflows.
  
  Methodological Clarification
  
  The instruction to "lightly squish" specimens to compromise the cuticle is practical, but could benefit from additional detail. Does the location of compression affect outcomes? For example, is thoracic compression preferable (to access muscle tissue), or is abdominal disruption sufficient? This may seem like a fine detail, but it matters in practice, particularly because damage to thoracic characters or terminalia can reduce taxonomic value. If there is an optimal or recommended compression location, it would be useful to specify it.
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2025.07.03.662936v1
www.biorxiv.org www.biorxiv.org

Cell type transcriptomics reveal shared genetic mechanisms in Alzheimer’s and Parkinson’s disease

2
1. GigaScience 10 Jul 2026
  
  in GigaScience
  
  ABSTRACTHistorically, Alzheimer’s disease (AD) and Parkinson’s disease (PD) have been investigated as two distinct disorders of the brain. However, a few similarities in neuropathology and clinical symptoms have been documented over the years. Traditional single gene-centric genetic studies, including GWAS and differential gene expression analyses, have struggled to unravel the molecular links between AD and PD. To address this, we tailor a pattern-learning framework to analyze synchronous gene co-expression at sub-cell-type resolution. Utilizing recently published single-nucleus AD (70,634 nuclei) and PD (340,902 nuclei) datasets from postmortem human brains, we systematically extract and juxtapose disease-critical gene modules. Our findings reveal extensive molecular similarities between AD and PD gene cliques. In neurons, disrupted cytoskeletal dynamics and mitochondrial stress highlight convergence in key processes; glial modules share roles in T-cell activation, myelin synthesis, and synapse pruning. This multi-module sub-cell-type approach offers insights into the molecular basis of shared neuropathology in AD and PD.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giag059), which carries out single-anonymized peer review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 2：
  
  This study presents an analysis of gene co-expression modules in Alzheimer's disease (AD) and Parkinson's disease (PD), utilizing a pattern learning framework predicated on single-nucleus RNA sequencing data. The research uncovered shared molecular mechanisms between the two diseases within neurons, microglia, oligodendrocytes, and astrocytes. These include cytoskeletal dynamics, mitochondrial stress responses, T cell activation processes, dysregulated myelin synthesis pathways, and abnormal heavy metal handling mechanisms. These findings illustrate common genetic underpinnings of AD and PD in specific cellular contexts, thereby providing novel insights into their pathological overlaps. 1. It is recommended that the author allocate a specific amount of space in the introduction section to elucidating the relationship between neurological disorders and other types of diseases. 2. Detailed validation results using independent AD and PD datasets should be presented in the paper to enhance the credibility of the study. 3. The Methods section should provide a more comprehensive description of the statistical methods used for batch effect correction and control of potential confounding factors. 4. To strengthen the biological relevance of the findings, functional experiments such as in vitro cell assays or animal models should be performed to validate the roles of identified key gene modules in the pathogenesis of AD and PD. 5. The discussion section should further elaborate on the clinical implications of these shared molecular mechanisms, including their potential as therapeutic targets or biomarkers.
2. GigaScience 10 Jul 2026
  
  in GigaScience
  
  ABSTRACTHistorically, Alzheimer’s disease (AD) and Parkinson’s disease (PD) have been investigated as two distinct disorders of the brain. However, a few similarities in neuropathology and clinical symptoms have been documented over the years. Traditional single gene-centric genetic studies, including GWAS and differential gene expression analyses, have struggled to unravel the molecular links between AD and PD. To address this, we tailor a pattern-learning framework to analyze synchronous gene co-expression at sub-cell-type resolution. Utilizing recently published single-nucleus AD (70,634 nuclei) and PD (340,902 nuclei) datasets from postmortem human brains, we systematically extract and juxtapose disease-critical gene modules. Our findings reveal extensive molecular similarities between AD and PD gene cliques. In neurons, disrupted cytoskeletal dynamics and mitochondrial stress highlight convergence in key processes; glial modules share roles in T-cell activation, myelin synthesis, and synapse pruning. This multi-module sub-cell-type approach offers insights into the molecular basis of shared neuropathology in AD and PD.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giag059), which carries out single-anonymized peer review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 1：
  
  The authors utilized single-nuclei RNA-seq data and GWAS data to study the disease comorbidity between Alzheimer's disease and Parkinson's disease. They also validated their discoveries using independent evaluation cohorts. This is an overall well-conducted and comprehensive study with a solid logical framework and clear presentation of very interesting biological results. I have few points that need clarifications.
  
  1.In section 'Common Mechanisms of microglia's involvement in AD and PD', lines 790-792, the authors mentioned that 'we noted T cell activation in both AD and PD (all 4 datasets).' This is very interesting biological discovery. Based on my knowledge, the amount of T cells extracted from brain snRNA-seq are quite small (I could be wrong). Can authors explain something that could prove the conclusion is not biased due to the small population of T cells?
  
  The authors mentioned using PLS-DA method to extract disease modules for AD and PD datasets respectively. if I understand correctly, that the authors apply PLS-DA method to snRNA-seq data instead of for example pseudobulk aggregated for each cell type. Can the authors explain and add some logics in the method section to further clarify why applying PLS-DA to snRNA-seq make sense? As we know snRNA-seq data are quite sparse.
  
  In Figure 4D, authors presented multiple gene modules, can the authors also add gene/protein symbols as well, not just circles? Readers can know directly what are related proteins.
  
  In the method section, about the 'Differential gene expression', it seems to me that the authors do not mention covariates adjustments when computing DEGs, so does authors consider any covariates for DEG? for example, sex or PMI?
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2025.02.17.638647v1
www.biorxiv.org www.biorxiv.org

4D Single-Cell Spatial Transcriptomics Reveals Dynamic Morphogenetic Gradients and Regenerative Domains in Planarians

3
1. GigaScience 10 Jul 2026
  
  in GigaScience
  
  AbstractRegeneration relies on precise spatiotemporal gene expression and cellular responses to establish tissue identity and body patterning. Using high-resolution Stereo-seq (715 nm) on 353 sections from 16 whole animals at 8 regeneration timepoints, we constructed a 4D spatiotemporal transcriptomic map of planarian regeneration. Our analysis captured 36 refined cell types from 3,508,004 segmented cells, enabling genome-wide transcriptional imputation of gene expression dynamics across body axes at cellular, tissue, and organismal scales. We identified dynamic positional gradients and distinct spatially distributed cell types during regeneration, including an injury-induced Anterior Regenerative Zone (ARZ). The ARZ exhibited enriched positional signals in epidermal, muscle, and neural cells and was regulated by Mediator 8, which is crucial for polarity remodeling and blastema formation. This study provides a comprehensive spatial molecular and cellular map of regenerative processes, highlighting injury-induced spatial domains and key regulatory factors in planarian regeneration. We also provide an interactive web portal, offering a valuable resource for exploring and analyzing regeneration mechanisms in a spatiotemporal context.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giag064), which carries out single-anonymized peer review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 3：
  
  In the manuscript entitled "4D Single-Cell Spatial Transcriptomics Reveals Dynamic Morphogenetic Gradients and Regenerative Domains in Planarians", Han, Chen, Li et al. use high-resolution Stereo-seq on regenerating planarians to reconstruct a 4D spatiotemporal transcriptomic map of planarian regeneration. In their analysis, they recognize most of the cell types identified in planarians and are able to recover the gene expression dynamics during regeneration of the body axes at the cellular, tissue, and organismal scales. One of the main findings is the identification of injury-induced spatial domains, specifically the Anterior Regenerative Zone (ARZ). Interestingly, the ARZ is enriched in positional control gene expression in several cell types, not only in muscle. The authors identify Mediator 8 as a gene expressed in the ARZ and required for proper blastema formation. The study also provides an interactive web portal with the corresponding data. The analysis of the regeneration process in planarians using Stereo-seq provides a new and very useful strategy to understand the dynamics of gene expression integrated with cell and tissue types. Through this strategy, the authors corroborate the expression patterns of several genes already described as essential for planarian regeneration, and they identify new blastema regions comprising different expression patterns, both anterior and posterior. Among them, the study focuses on the ARZ and performs trajectory analysis, which provides a very informative view of the cellular movements and changes that occur at early stages of regeneration. The finding that Med8 is required to initiate regeneration in both wounds validates the utility of the strategy followed. The publication of an open and interactive web portal with the dataset will be a useful tool for the planarian community and for research on regenerative processes in general. However, in its present form, the study presents several weaknesses and issues that should be addressed before publication.
  
  Main concerns:
  
  The 3 main concerns are 1) the presentation of the strategy of each analysis performed is not detailed and not clear enough; 2) essential data for the present manuscript is supposedly found it Han et al. submitted, when it should be available in the present manuscript; and 3) the conclusions from the functional analysis of med8 are not accurate.
  
  Regarding concern 1: - The authors explain that they analyzed 16 animals (2 per time point) processed into 10 μm thick sections, producing a total of 353 sections. However, they do not specify whether the same number of sections were analyzed per animal, nor do they indicate the total size (or thickness) and cell number of the animals analyzed. In this regard, it may be that the number of sections analyzed per animal is shown in Figure S2A (this is not clear from the figure legend or the text). If so, why is there variation in the number of sections per animal? Is it due to differences in animal size? If so, how were these differences addressed in order to integrate the data from different animals? - In Figure 1F, the different parts of the blastema are divided according to the pigmented/unpigmented area. What were the criteria used to divide each blastema into three parts (proximal, middle, and distal)? Gene expression? Length of the region? This should be clarified. - A schematic overview of each strategy used to obtain the results would help the reader understand the procedures.
  
  Regarding concern 2: - Throughout the study, the authors refer to Han et al., submitted for the custom framework used to create the atlas. Since that study is not currently available, this important information is missing from the current manuscript. - Similarly, from the section "Spatiotemporal dynamics of positional gradients during whole-body regeneration" onward, the study relies on the so-called SBGs (spatially biased genes), also described in Han et al., submitted, which is not published. This is very important data that, if not published elsewhere, must be included in the current manuscript. A description of how the SBGs were obtained, together with a list and explanation, is required. Otherwise, essential information about the SBGs—on which the subsequent analyses depend—is entirely missing. - According to the authors, PCGs are a subset of SBGs. What is the essential difference between PCGs and SBGs?
  
  Regarding concern 3: - Functional analysis of med8. The authors refer to a role of this gene in polarity remodeling. First, this is an unclear concept, because polarity establishment and polarity remodeling are two different processes during regeneration. Second, the RNAi results presented do not support a role for med8 in polarity. The phenotype suggests that med8 is required for cell-fate specification during early stages of regeneration, since neoblasts increase but differentiated cells decrease in general. However, the results do not support a role for med8 in pole formation. The authors report a decrease in sfrp-1, but only at later stages, and no data are shown for notum. The conclusion that med8 regulates blastema growth and positional information is therefore not accurate and does not align with the results presented.
  
  Additional main concerns: - A spatiotemporal transcriptomic atlas of planarian regeneration was already published in Cui et al. 2023. Although the authors cite it in the introduction, they should also compare their results with those published previously. Do they observe similar cell types and gene expression dynamics at the time points analyzed? This comparison should appear in both the Results and Discussion sections. - The authors state that the different clusters of SBGs fluctuate during regeneration and then stabilize. First, which genes belong to each cluster? This information should be included. Second, according to Figure 2A, some clusters fluctuate (e.g., A1 and A2) but others do not (e.g., A5 or M9). A more accurate interpretation is required. Furthermore, the starting point of the analysis is t0, when head and tail are missing, and the endpoint is 14 days of regeneration. How can one assess whether gene expression stabilizes if the starting and ending states are completely different? Importantly, in Figure 2A, the cluster labels in the image do not correspond to the bars, and there are 15 bars but 16 clusters. The figure should be corrected. The colormap labels are also missing. - The authors find that predictions from the Gierer-Meinhardt model are consistent with the expression of some genes (ARNT, Ndk, EGR1…). First, a description and reference for these genes are required. Second, what about other genes involved in Wnt signaling and AP patterning, previously proposed to follow this model in Werner et al. 2015? Third, how can transcription factors such as Hox4b follow the model if they are not secreted? - In general, the manuscript does not refer to specific PCGs known to be critical for regeneration and patterning, such as notum and wnt1. Is this because they could not be identified in the dataset? If so, is it not possible to perform FISH and integrate these results with the Stereo-seq data?
  
  Additional issues to be addressed:
  
  Lines 161-163: Why do the authors conclude that the increase in dorsal epidermal progenitors, neural progenitors, and pharyngeal lineages indicates an active wound response or the generation of new tissue? This statement is not clear.
  
  The 4D atlas annotates 36 refined cell types, which is a noteworthy result when compared with published scRNA-seq databases. The authors should compare their results with scRNA-seq in terms of the number and identity of cell types.
  
  In Figure 1F, the blastema is divided according to the pigmented/unpigmented region. However, in Figure 1G the dividing lines do not follow this curvature. Should they not follow the pigment curvature as well? Additionally, why do the blastemas in Figure 1G show such pronounced lateral deviation? Is this because the incisions were not perpendicular to the AP axis? If so, with this variability, it is difficult to understand how the datasets could be integrated. A detailed explanation is required.
  
  Lines 202-203: "To test this, we analyzed the spatiotemporal patterns of SBGs by mapping representative samples from each time point onto clusters along the A/P axis using logistic regression (Fig. 2A)." What is meant by "representative samples"? The procedure should be specified more clearly.
  
  Data S4 is not mentioned in the text.
  
  smed03831, caveolin3, and smed01640 are the genes enriched in the ARZ domain. However, the authors perform functional analysis only with med8. Is there any specific reason for this?
2. GigaScience 10 Jul 2026
  
  in GigaScience
  
  AbstractRegeneration relies on precise spatiotemporal gene expression and cellular responses to establish tissue identity and body patterning. Using high-resolution Stereo-seq (715 nm) on 353 sections from 16 whole animals at 8 regeneration timepoints, we constructed a 4D spatiotemporal transcriptomic map of planarian regeneration. Our analysis captured 36 refined cell types from 3,508,004 segmented cells, enabling genome-wide transcriptional imputation of gene expression dynamics across body axes at cellular, tissue, and organismal scales. We identified dynamic positional gradients and distinct spatially distributed cell types during regeneration, including an injury-induced Anterior Regenerative Zone (ARZ). The ARZ exhibited enriched positional signals in epidermal, muscle, and neural cells and was regulated by Mediator 8, which is crucial for polarity remodeling and blastema formation. This study provides a comprehensive spatial molecular and cellular map of regenerative processes, highlighting injury-induced spatial domains and key regulatory factors in planarian regeneration. We also provide an interactive web portal, offering a valuable resource for exploring and analyzing regeneration mechanisms in a spatiotemporal context.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giag064), which carries out single-anonymized peer review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 2：
  
  In the manuscript '4D single-cell spatial transcriptomics reveals dynamic morphogenetic gradients and regenerative domains in planarians,' Han and colleagues generate a truly stunning spatial transcriptomics dataset of planarian regeneration from the species Schmidtea mediterranea. The authors' dataset includes whole 3D reconstructions of two regenerating planarian fragments at 8 different timepoints during regeneration, a fantastic accomplishment and resource of broad interest to the regenerative biology community. The authors analysis of the dataset includes characterization of spatially biased genes (SBGs) and exploration of an anterior regenerative zone (ARZ) and the role of the gene med8 in its' regulation. While the authors' dataset is remarkable and their analysis of spatially biased genes and med8 function is interesting, I'm not yet convinced that their conclusions are fully tested by the included experiments. In addition, I think that the authors have not included sufficient quality control metrics for their spatial dataset, which makes determining the limitations or caveats of their analysis and conclusions more difficult. However, my concerns could be addressed by additional analysis and minor experiments, or by softening the conclusions of the authors to include alternative models. I've detailed the areas of analysis/discussion that I believe require improvement below:
  
  Major Criticisms: 1. Stereo-seq resolution and capture efficiency: The authors assert that their spatial approach is high enough resolution to resolve cell types and they claim to have characterized 36 cell types in their abstract. However, the 'cell type' in their dataset that they choose to focus on - Clu.31 - has gene markers expressed in three different cell types that have been shown to be distinct in the literature and prior planarian atlases. The authors should analyze gene expression signatures of other stereo-seq 'cell types' to determine if they also show mixed expression signatures. In addition, I am curious if stereo-seq is more likely to capture highly expressed genes (like those expressed in parenchymal cell types) than more lowly expressed genes (like the transcription factors expressed in stem cells). If it exists, this bias could influence annotation of cell types in highly heterogeneous regions of the worm like the parenchyma or parapharyngeal region. Finally, there is very little QC data in the supplementary materials (Size/volume of segmented cells, UMIs and features per cell, variability in features/UMIs per section, per replicate, and per cell type, etc.) I think this analysis would be highly valuable for the reader to interpret the data and the 36 identified 'cell types'.
  
  Dynamics of spatially biased genes: The authors analysis on the dynamics of spatially biased genes (SBGs) is very interesting, but the 'oscillations' the authors referred to were not clear to me in the data across all or even most of the pattern clusters in Figure 2A. In general, it seemed more like the pattern cluster was 'noisy' or more broad before stabilizing to its final location. In addition, the PCA analysis in Figure 2B seems to show that Intact and 14dpa transcriptomics is very similar, but 0h, 12h, and 36h timepoints are very distinct from 3, 5, 7, and 10 day fragments. This would suggest that early wound response gene expression is highly distinct (even opposing) the gene expression programs active during late in regeneration. More exploration of this idea, as well as clarified language on exactly what the author means by 'oscillations' and which gene groups follow this pattern would greatly improve this section and better support the author's conclusions.
  
  The Cellular/Functional identity of Clu.31: The authors state throughout the manuscript that Clu.31 (the ARZ) is an injury-induced anterior state enriched for SBGs and regulating polarity establishment. However, it is also possible that this spatial state represents the anterior peripheral nervous system (numerous sensory neurons and surface epithelial cells that help sense mechanical and chemical cues). SBGs could be enriched because this combination of cell types is only present in the anterior of the animal. Indeed, the authors show that the ARZ is localized to the anterior in intact animals in the absence of an injury (Figure 3) and enriched genes (S4Aii) strongly indicate that Clu.31 contains gabrg+ mechanosensory neurons. If Clu.31 is regenerating nervous system, this would also explain its ventral bias and expression of tgs-1 and other nb2 genes, since nb2 neoblasts have been suggested to be both an amputation responsive neoblast subset (Zeng et al. Cell) and a neural progenitor state (Raz et al. Cell Stem Cell). Clarifying how the composition of the tri-lineage region changes during regeneration may help distinguish if Clu.31 is truly an injury induced region vs. the regenerating sensory nervous system. For example, it is known that agat-1+ cells transcriptionally responsive and enriched at the wound site a 2-4 days post amputation, but less so at later timepoints (Benham-Pyle et al Nature Cell Biology, Kent et al. Developmental Biology). This shift in composition should be observable in Clu.31 since it contains agat+ epidermal cells. Such a shift in composition or the identification of a regeneration-specific marker expressed in Clu.31 would add support to the author's conclusions. Regardless of the outcome of these experiments/analyses, the discussion and interpretation of the data could be modified to address the hypothesis that Clu.31 represents the cellular neighborhood created when the peripheral nervous system intercalates with the anterior DV boundary epithelium and body wall muscle, which needs to be regenerated in amputated worms. As is, the comparison to the apical epithelial cap considered in the discussion (Line 438) may be pre-mature.
  
  Med8 function: Med8 produces a clear phenotype in the authors' experiments, and their data indicates that it is required for ARZ formation. However, I am not sure that the authors data supports the claim that Med8 is directly regulating blastema and PCG expression, as opposed to regeneration of the nervous system (which is highly interconnected with formation of the anterior pole and the size of the anterior blastema) and stem cell function more broadly. The fact that Med8 RNAi also leads to head degeneration in intact worms (Figure S6F) strongly suggests a more fundamental defect in neural differentiation or stem cell function. The strongest evidence presented by the authors supporting a broader function in polarity establishment is the disruption of posterior Wnt expression, (Figure 5F and G), but these in situs are single representative images with no quantitation and could also be explained by a stem cell defect. Additional data could be provided (e.g. visualization of wound-induced gene expression, quantitation of anterior or posterior stem cell numbers and proliferation rates at 2dpa) to support regulation of PCGs or blastema formation. The authors could also leverage their single cell sequencing to determine if Med8 RNAi impacts neural progenitor abundance more than other progenitor cell types. Together, these experiments would determine if Med8 is important for amputation-induced blastema formation and polarity re-establishment vs. stem cell function and neural differentiation more broadly.
  
  Minor Criticism/Feedback: 1. In Figure 1I, the authors show DEGs enriched in each cluster/region. In the blastema regions, I was surprised by the number of DEGs for each time point. It appears that there are ~10K upregulated and 10K downregulated DEGs by the later time points, which suggests that 2/3 of the transcriptome is differentially expressed… The authors should clarify in the text or methods what cutoff they used for the DEGs and how significant the DEGs are in this figure. 2. For readability, I really think that all figures should be on a white background. 3. How do gene expression profiles from the stereo-seq compare to bulk rnaseq at similar timepoints? 4. It is very interesting that there are some cell types that appear to contract and then expand during regeneration (Cluster 0, 23) or that aggregate/become more targeted during regeneration (pharynx pouch, cluster 29). Molecular differences between early and late cells within these cell types would be particularly interesting for understanding different phases of regeneration, but this may be beyond the scope of the current study. 5. The authors frequently reference Han et al. submitted, but this manuscript would need to be pre-printed or published in order for this work to reference it. 6. The Y axis of Figure 2E should be labeled
3. GigaScience 10 Jul 2026
  
  in GigaScience
  
  AbstractRegeneration relies on precise spatiotemporal gene expression and cellular responses to establish tissue identity and body patterning. Using high-resolution Stereo-seq (715 nm) on 353 sections from 16 whole animals at 8 regeneration timepoints, we constructed a 4D spatiotemporal transcriptomic map of planarian regeneration. Our analysis captured 36 refined cell types from 3,508,004 segmented cells, enabling genome-wide transcriptional imputation of gene expression dynamics across body axes at cellular, tissue, and organismal scales. We identified dynamic positional gradients and distinct spatially distributed cell types during regeneration, including an injury-induced Anterior Regenerative Zone (ARZ). The ARZ exhibited enriched positional signals in epidermal, muscle, and neural cells and was regulated by Mediator 8, which is crucial for polarity remodeling and blastema formation. This study provides a comprehensive spatial molecular and cellular map of regenerative processes, highlighting injury-induced spatial domains and key regulatory factors in planarian regeneration. We also provide an interactive web portal, offering a valuable resource for exploring and analyzing regeneration mechanisms in a spatiotemporal context.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giag064), which carries out single-anonymized peer review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 1：
  
  The authors employs high-resolution Stereo-seq technology combined with multi-timepoint spatial transcriptomic data to construct a 4D spatiotemporal transcriptomic map of planarian regeneration. This work significantly advances the understanding of spatial gene expression dynamics during planarian regeneration, overcoming the limitations of traditional two-dimensional and planar spatial transcriptomics. Furthermore, the authors identify a novel injury-indced Anterior Regenerative Zone (ARZ) and, through functional validation of Mediator 8 (Med8), deepen insights into the mechanisms underlying polarity remodeling in planarians. The study also provides an interactive online database, enriching spatial molecular and cellular data resources for the regenerative biology. The work is notably innovative, and the authors present convincing evidence supporting their conclusions. The manuscript overall is written well and data is presented clearly. The discussion and conclusions has done well to highlight the potential problems in this study. I have a few points that should be addressed before publishingI have a few points that should be addressed before publishing.
  
  1.In lines 233-236, it is reported that the positional control gene (PCG) like ndk restores its spatial expression pattern as early as 12 hpa, whereas its expression level only significantly increases at 36 hpa. Given this pronounced temporal discordance between early recovery of spatial patterning and the later peak in mRNA levels, the authors should analyze and discuss possible molecular mechanisms that could account for this discrepancy, and consider the biological implications of this phenomenon for understanding how spatial information and gene-expression regulation are coordinated during regeneration.
  
  2.In lines 356-360, Med8 knockdown markedly reduces the ARZ cell lineages and the expression of anterior-posterior polarity markers (e.g., sfrp-1, wnt1, wnt11-1), producing a clear effect on regeneration polarity formation.However no gross disruption of the whole-body AP axis was observed. Please further analyze and discuss the possible regulatory scope and mechanisms of Med8. Specifically, do other redundant pathways or compensatory mechanisms exist in planarians that maintain global positional information despite loss of Med8? What is the hierarchical and cell-type specificity of Med8's role in polarity regulation? Transfer Authorization Response
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.64898/2026.02.18.706529v1
www.biorxiv.org www.biorxiv.org

PathoFact 2.0: An Integrative Pipeline for Predicting Antimicrobial Resistance Genes, Virulence Factors, Toxins and Biosynthetic Gene Clusters in Metagenomes

3
1. GigaScience 10 Jul 2026
  
  in GigaScience
  
  AbstractSummary Antimicrobial resistance genes (ARGs) and virulence factors (VFs) are central contributors to the global health crisis surrounding drug-resistant infections. PathoFact, a bioinformatics pipeline introduced in 2021, provides insights into ARGs, VFs, and bacterial toxins from metagenomic data. However, recent advancements in bioinformatics highlight the need for an updated version of PathoFact. We introduce PathoFact 2.0, an enhanced pipeline for improved ARG, VF, and toxin prediction. Key updates include an updated machine learning (ML) model for VF identification, a new ML model for toxin identification, expanded hidden Markov model profiles, and the antiSMASH 7.0 integration for predicting biosynthetic gene clusters. These upgrades make PathoFact 2.0 a more powerful, user-friendly platform for predicting microbiome-based pathogenicity and resistance, offering a crucial tool for better understanding and addressing the challenges posed by antimicrobial resistance and infectious diseases.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giag062), which carries out single-anonymized peer review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 3：
  
  Several methods are available to predict ARGs, VFs, Toxins, and Biosynthetic Gene Clusters. However, the authors selected only a few tools to benchmark PathoFact 2.0. I find this point lacking in the manuscript. To be useful to the scientific community, a more rigorous performance evaluation is needed. It is not fully clear how the "false" sequences were chosen. Ideally, they should be similar to known resistance genes, but should not confer resistance. Details of the parameters used to create the HMM models are not mentioned in the manuscript. The performance of the updated HMMs in comparison to the older version is not shown. It would be interesting to show how updates in DeepARG, RGI, and AMRFinderPlus have improved the performance of PathoFact 2.0 over version 1.0. *I believe the non-pathogenic dataset was constructed using sequences other than those mentioned in the section "Generalities about Machine learning training set-up and 'non-pathogenic". This means that sequences that do not contain the mentioned keyword were used as the negative dataset. These sequences include housekeeping genes, which are also too distant from the ARG, VF, etc. The real test of an ML model occurs with data from the grey zone, which has properties of both negative and positive examples. The authors can benchmark the ML model using the grey-zone data to show the efficiency of the ML model.
  
  Based on the above-mentioned points, I recommend for major revision of the manuscript.
2. GigaScience 10 Jul 2026
  
  in GigaScience
  
  AbstractSummary Antimicrobial resistance genes (ARGs) and virulence factors (VFs) are central contributors to the global health crisis surrounding drug-resistant infections. PathoFact, a bioinformatics pipeline introduced in 2021, provides insights into ARGs, VFs, and bacterial toxins from metagenomic data. However, recent advancements in bioinformatics highlight the need for an updated version of PathoFact. We introduce PathoFact 2.0, an enhanced pipeline for improved ARG, VF, and toxin prediction. Key updates include an updated machine learning (ML) model for VF identification, a new ML model for toxin identification, expanded hidden Markov model profiles, and the antiSMASH 7.0 integration for predicting biosynthetic gene clusters. These upgrades make PathoFact 2.0 a more powerful, user-friendly platform for predicting microbiome-based pathogenicity and resistance, offering a crucial tool for better understanding and addressing the challenges posed by antimicrobial resistance and infectious diseases.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giag062), which carries out single-anonymized peer review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 2：
  
  The authors present the pipeline PathoFact2.0, which combines external modules and machine learning algorithms in order to find genes that provoke antimicrobial resistance, virulence and toxicity. They present their work, the improvement from the previous version, as a Technical Note. For what the authors say, it is the only pipeline available with those characteristics, which makes it clearly a relevant software and article. However, I believe the article requires refinement, as well as new tests that support the authors claims.
  
  Refinements on the article In the abstract, ARG is used as an abbreviation for Antimicrobial Resistance, not Antimicrobial Resistance Genes. Overall, the article should be more clear in what the pipeline is made for. It mentions fungi, viral, etc… sequences (which might be found on a metagenomic sample, of course), but to my understanding, all the tools and phenotypes searched for are mostly characteristic of bacteria. While the introduction offers a good resume of the genes of interests, there are some descriptions that are not particularly accurate. "Human, animal, and environmental microbiomes harbour commensal and pathogenic microorganisms, contributing to the emergence of infectious diseases" seems to say that commensal microorganisms contribute to the emergence of infectious diseases; "ARGs are genetic elements that confer bacterial resistance to antibiotics, acquired via mutations or horizontal gene transfer." seems to say that antimicrobial resistance genes are acquired via mutation (they are not, there is a difference between resistance to antimicrobials provoked by mutations and by genes). I recommend a thorough rewriting of the 6 first paragraphs. The graphs in Figures 1 and 2 have different Y axes, which are also not shown. This is, to say it lightly, very misleading. Table 1 would be much more clear as a Figure. Table 1 cited in line 211 does not exist. A short description of "dereplication" would help users lacking that knowledge. The description of the parameters of the machine learning modules are a copy-paste of the variables used by scikit-learn (lines 248 and 268). The description of the machine learning models should be clearer and more detailed, as well as not force the reader to go to the instructions of scikit-learn to check what is the meaning of those parameters. The authors analyze the performance of the models depending on the "probability" (a term that could definitely use a better introduction) using barplots. A standard to analyze the probability of a model is a ROC curve. Improvements The claims of the authors about the machine learning models for VF and toxin prediction being more accurate than similarity models is, to my understanding, not proved in the article. If it is only compared to PathoFact, which was created with a dataset made years ago, the higher performance could easily be because of more complete datasets. A fair comparison of an improved performance should be done with the same dataset (PathoFact but with the dataset collected for PathoFact2.0). Moreover, the results only show that PathoFact2 predicts more toxins and virulence factors than PathoFact. The creation of the dataset, for training and most importantly for testing, is rather unclear and described all over the article. I recommend creating a section for it, to understand better the filtering, maybe a figure (could go in the supplementary material, if necessary), and include the amount of data in each test set. The authors seem to have put a lot of effort on the testing sets (including trying to avoid testing with the same data that the models are trained with) but it gets diluted in the article and, in consequence, the test results are difficult to evaluate. The results against VirulentHunter are impressive, outperforming a fine-tuned language model. While I do not doubt that the authors are thorough in their methods, such claims require more testing. Testing using external databases (not created by the authors, maybe the same used by VirulentHunter or other models validated experimentally such as pLM4VF) would support such claims. The comparative on different bacterial strains gives more questions than answers. Are all those VF and toxins found on E. coli experimentally validated? How much overlap is there between pathogenic and non-pathogenic E. coli? Are all of the same type?
  
  Overall, there is a good amount of work on this project, but the article still has a lot of unanswered questions. It is a bit unclear the strengths of PathoFact2, as well as its weaknesses (any model has). Could be its speed, could be having plenty of tools contained in a pipeline. I would also appreciate a better description of the report that PathoFact2.0 produces. If its strength is the virulence and toxin prediction, more tests must be performed (as described above). This would be very beneficial for possible users of the model. Moreover, in a more technical note, I recommend the authors to add a test sample for easy testing of the model in their repository.
3. GigaScience 10 Jul 2026
  
  in GigaScience
  
  AbstractSummary Antimicrobial resistance genes (ARGs) and virulence factors (VFs) are central contributors to the global health crisis surrounding drug-resistant infections. PathoFact, a bioinformatics pipeline introduced in 2021, provides insights into ARGs, VFs, and bacterial toxins from metagenomic data. However, recent advancements in bioinformatics highlight the need for an updated version of PathoFact. We introduce PathoFact 2.0, an enhanced pipeline for improved ARG, VF, and toxin prediction. Key updates include an updated machine learning (ML) model for VF identification, a new ML model for toxin identification, expanded hidden Markov model profiles, and the antiSMASH 7.0 integration for predicting biosynthetic gene clusters. These upgrades make PathoFact 2.0 a more powerful, user-friendly platform for predicting microbiome-based pathogenicity and resistance, offering a crucial tool for better understanding and addressing the challenges posed by antimicrobial resistance and infectious diseases.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giag062), which carries out single-anonymized peer review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 1：
  
  The pipeline should be very useful on shotgun metagenomics data analysis. Aside the ARGs and VFs, the features on signal peptides, toxin predictions, and BGCs in particular for specialised metabolites predictions, are welcome for detailed analysis and understanding of various transmission mechanisms. I find the approaches very appealing and I think the pipeline could be welcomed by the community.
  
  I only have some minor observations: - The Methods section should be placed next to the described methods. As it is, at the end of the manuscript, under Methods chapter you can only find Datasets, so a proper formatting of the Methods is required - there is a 70 blank pages buffer between References and supplementary data - could you add some future prospects in the manuscript? How well is it going to be maintained - I noticed the update are quite old. - I would also add some more details to the limitations. For instance it is clear that the pipeline is installable on Linux platforms, but did you considered making it available also for Apple silicon series? More and more researchers use this technology, and it works as good as the Linux distributions. I also tried an install on a M series Apple silicon, but unfortunately, most of the tools in the pipeline lead to multiple errors related to python versions (most of which are old), missing old dependencies versions, libraries, etc.
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2024.12.09.627531v1
www.biorxiv.org www.biorxiv.org

FEDRANN: effective overlap graph construction based on dimensionality reduction and approximate nearest neighbors

2
1. GigaScience 10 Jul 2026
  
  in GigaScience
  
  AbstractOverlap detection is a key step in de novo genome assembly pipelines based on the Overlap-Layout-Consensus (OLC) paradigm. However, existing methods for overlap detection either rely on heuristic seed-and-extension strategies or locality-sensitive hashing (LSH), both of which struggle to handle repetitive genomic regions and the computational burden of large-scale datasets. Here, we present FEDRANN, a novel strategy for overlap graph construction that integrates feature extraction, dimensionality reduction (DR), and approximate nearest neighbor (ANN) search. We find the pipeline combining inverse document frequency (IDF) transformation, sparse random projection (SRP), and NNDescent enables accurate detection of overlaps across diverse datasets. We developed an efficient open-source implementation of this pipeline named Fedrann (https://github.com/jzhang-dev/fedrann). Through systematic benchmarking on real long-read sequencing data, we demonstrate that Fedrann produces overlap graphs comparable to or better than those generated by existing state-of-the-art tools, including MECAT2, minimap2, and wtdbg2, while maintaining competitive runtime. Despite being implemented primarily in Python, Fedrann achieves performance on par with tools written in compiled languages, owing to matrix-based representations and C-accelerated numerical libraries. Our results suggest that DR and ANN techniques offer a promising new direction for scalable and accurate overlap detection in long-read assembly and broader sequence similarity search tasks.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giag048), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 2：
  
  The authors present METRIN-KG, a knowledge graph integrating plant metabolomic, trait, and biotic interaction datasets. The work should have substantial value for multiple plant sciences and ecology domains. The overall effort to harmonize disparate resources into an integrated, semantically coherent resource is impressive. The methodology includes a notable pipeline for ontology alignment across multiple sources. However, despite its technical strengths, several issues regarding data accessibility and manuscript structure and clarity should be addressed. The manuscript is highly technical throughout. While this level of precision is exemplary, it may alienate biologically oriented readers, and effort should be made so that the impact and manuscript is clear to a larger audience.
  
  Major comments
  
  The online deployment currently presents several problems that must be resolved before publication.
  
  The existence of multiple SPARQL UIs (https://kg.earthmetabolome.org/metrin which redirects to https://qlever.earthmetabolome.org/metrin-kg/ and https://sib-swiss.github.io/sparql-editor/metrin-kg) is not explained, and the redundancy is potentially confusing.
  
  At the time of review, https://qlever.earthmetabolome.org/metrin-kg/ showed expired SSL certificates, making it inaccessible for most users. Automatic certificate renewal (e.g., using certbot) should be implemented.
  
  Attempting to use the SIB UI resulted in an error due to the redirect.
  
  The certificate issue also prevented evaluation of ExpasyGPT querying against METRIN-KG.
  
  Usability represents a significant barrier. Many potential users such as biologists without semantic-web or RDF experience are unlikely to be able to interpret the current figures or formulate SPARQL queries. The manuscript briefly mentions ExpasyGPT, which has strong potential to overcome this barrier by allowing natural language querying. This tool should be emphasized more prominently, potentially with a dedicated subsection and discussion of its role in broadening the resource's accessibility.
  
  To fully demonstrate the relevance of METRIN-KG, the use-case section would benefit from quantitative summaries, visualizations (e.g., distributions, network visualizations), and biological interpretations of query results.
  
  The overall manuscript organisation would benefit from restructuring to improve readability and better expose the impact of the work done.
  
  The methodological content currently in "Mapping of TRY data" and "Mapping of GloBI data" should be moved to the Methods section, while the quantitative outputs (e.g., numbers of records) should be moved into a dedicated Results section.
  
  The current "Data re-use and case studies" section could be reorganised into Results and Discussion sections,
  
  Results include: description of outputs from the methodological steps (e.g. the ontology, the successfulness of the mapping process, the size of the final knowledge graph/number of triples, and other relevant metrics; the user interface, including being able to share and add example questions and write NL questions via ExpasyGPT; examples of SPARQL queries and case studies.
  
  Discussion includes: Reuse potential; case-study interpretations or impact; future directions including planned expansion and enhancing of ontological structure, etc.
  
  An overview and evaluation of the KG should be provided, for example the number of plant species, and the distributions of connections. Any gaps or any (potentially) biases due to the input data needs to be acknowledged. For example, it is very likely certain species may be under/overrepresented in either metabolome or interaction datasets, or possibly geographic skews could exist.
  
  The ontology is variously referred to as the "Earth Metabolome Ontology", "EMI ontology", and "EMI". Consistent naming should be adopted throughout the manuscript and associated repositories. It is also unclear whether the ontology is a result of this work. As written, the "Ontology" section under "Methods" reads more like a result/description than a methodological step. Clarifying what components are original contributions and presenting them in Results would strengthen the manuscript. Additionally, the phrases "our proposed framework" and "our approach" are ambiguous, do these refer to the ontology itself, the metadata-mapping pipeline, or the overall integration process? Finally, referring to METRIN-KG as a "tutorial to build a knowledge graph" appears to be a bit out of place, given the topic of the manuscript.
  
  Given its potential relevance beyond this project, the authors are strongly encouraged to publish the code for the metadata-mapping pipeline. In addition, the following details would strengthen the methodological rigor:
  
  Were any acceptability criteria implemented in the automated step, e.g. a minimum Cosine similarity threshold?
  
  Were the manual corrections systematically documented?
  
  In how many cases were manual corrections needed?
  
  Was any evaluation done on the embedding/model version or the source of errors?
  
  Minor comments
  
  The manuscript would benefit from proofreading for consistent use of Oxford commas (and an "&" instead of "and").
  
  Reference 190 contains a typo "GiHub"
  
  References should be checked (e.g. citation for [95] references both METRIN-KG Zenodo and GloBI Zenodo)
  
  The METRIN-KG Zenodo link in the article is not to the latest version (Version 5)
  
  Consider improving the figures, e.g. use of colour to better communicate the content and refining layouts.
  
  The authors should deposit a snapshot of the GitHub repository to Zenodo.
  
  The following are suggestions to the authors, to be followed by their own judgement:
  
  Table 1, Figure 3, Figure 4 could be moved to supplementary material to make space for figures for the case studies.
  
  The dense in-text list of ontologies in the "Metadata mapping" section could be replaced with a summarized table (e.g. by moving Supplementary Table 2, but including references to the main text).
  
  The full SPARQL queries in "Taxonomy mapping" could be moved to supplementary materials, with a high-level description left in the main text.
2. GigaScience 10 Jul 2026
  
  in GigaScience
  
  AbstractOverlap detection is a key step in de novo genome assembly pipelines based on the Overlap-Layout-Consensus (OLC) paradigm. However, existing methods for overlap detection either rely on heuristic seed-and-extension strategies or locality-sensitive hashing (LSH), both of which struggle to handle repetitive genomic regions and the computational burden of large-scale datasets. Here, we present FEDRANN, a novel strategy for overlap graph construction that integrates feature extraction, dimensionality reduction (DR), and approximate nearest neighbor (ANN) search. We find the pipeline combining inverse document frequency (IDF) transformation, sparse random projection (SRP), and NNDescent enables accurate detection of overlaps across diverse datasets. We developed an efficient open-source implementation of this pipeline named Fedrann (https://github.com/jzhang-dev/fedrann). Through systematic benchmarking on real long-read sequencing data, we demonstrate that Fedrann produces overlap graphs comparable to or better than those generated by existing state-of-the-art tools, including MECAT2, minimap2, and wtdbg2, while maintaining competitive runtime. Despite being implemented primarily in Python, Fedrann achieves performance on par with tools written in compiled languages, owing to matrix-based representations and C-accelerated numerical libraries. Our results suggest that DR and ANN techniques offer a promising new direction for scalable and accurate overlap detection in long-read assembly and broader sequence similarity search tasks.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giag048), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 1：
  
  Summary This paper presents FEDRANN, a novel approach to overlap detection in long-read genome assembly that combines feature extraction, dimensionality reduction (DR), and approximate nearest neighbor (ANN) search. The authors systematically evaluate a range of design choices and implement the best-performing pipeline (IDF-SRP-NNDescent) as an open-source tool, Fedrann. Benchmarking against state-of-the-art tools (minimap2, MECAT2, wtdbg2, BLEND, xRead, MHAP) shows that Fedrann achieves competitive or superior accuracy and overlap graph quality across multiple sequencing platforms (ONT, PacBio HiFi, CycloneSEQ), while maintaining reasonable runtime. The conceptual framing of overlap detection as a k-NN search problem is both innovative and potentially influential. Major Strengths 1. Novel conceptual framework: Framing overlap detection as a k-NN search problem, drawing analogies from single-cell analysis, is creative and opens new algorithmic possibilities for genome assembly. 2. Systematic evaluation: The authors carefully assess multiple feature extraction, DR, and ANN methods before converging on the optimal pipeline. 3. Strong empirical results: Fedrann demonstrates high accuracy and graph quality, often outperforming established tools while remaining runtime-efficient. Major Concerns and Recommendations 1. Memory consumption is a critical limitation o Fedrann requires >700 GB RAM for human genome datasets, which makes the tool impractical for most research labs and cost-prohibitive for cloud use. o While acknowledged, this limitation is somewhat downplayed. In its current state, Fedrann may be restricted to only very high-resource environments. o Recommendation: Either (a) demonstrate initial results of memory-reduction strategies (e.g., shared memory, memory-mapped structures, GPU acceleration), or (b) more prominently highlight this as a key limitation restricting practical adoption. 2. Incomplete evaluation at the assembly pipeline level o The paper evaluates overlap graph quality but does not show results on final genome assemblies. Major Concerns and Recommendations 1. Memory consumption is a critical limitation o Fedrann requires >700 GB RAM for human genome datasets, which makes the tool impractical for most research labs and cost-prohibitive for cloud use. o While acknowledged, this limitation is somewhat downplayed. In its current state, Fedrann may be restricted to only very high-resource environments. o Recommendation: Either (a) demonstrate initial results of memory-reduction strategies (e.g., shared memory, memory-mapped structures, GPU acceleration), or (b) more prominently highlight this as a key limitation restricting practical adoption. 2. Incomplete evaluation at the assembly pipeline level o The paper evaluates overlap graph quality but does not show results on final genome assemblies. Minor Comments • Figures S4–S6 (embedding dimension analysis) could be better explained in the main text with more intuitive interpretation. • Benchmarking fairness: tools designed for high recall (e.g., minimap2, MECAT2) may be disadvantaged by post-processing into “top k” mode. Clarify this limitation in comparisons. Minor Corrections • Figure 1: Step numbering is currently (1), (3), (4). Step (2) is missing. Overall Recommendation This paper presents a novel and promising framework for overlap detection with strong methodological rigor and empirical results. However, two major gaps — excessive memory usage and lack of assembly-level validation — must be addressed before the work can be considered fully convincing. In addition, clarity on basic parameters such as k-mer size is necessary for robustness.
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2025.05.30.656979v3
www.biorxiv.org www.biorxiv.org

METRIN-KG: A knowledge graph integrating plant metabolites, traits, and biotic interactions

2
1. GigaScience 10 Jul 2026
  
  in GigaScience
  
  AbstractBackground In recent years, biodiversity data management has emerged as a critical pillar in global conservation efforts. Today, the ability to efficiently collect, structure, and analyze biodiversity data is central to breakthroughs in conservation, drug development, disease monitoring, ecological forecasting, and agri-tech innovation. However, due to the vastness and heterogeneity of biodiversity data, it is often confined to databases for specific research areas in isolated formats and disconnected from other relevant resources. Crucial components of such data in kingdom Plantae comprise of metabolomes - the vast array of compounds produced by plants; traits - measurable characteristics of plants that influence their growth, survival, and reproduction, and that affect ecosystem processes; and biotic interactions - relationships of plants with other living organisms, affecting the ecosystem functions.Results In this work, we present METRIN-KG (MEtabolomes, TRaits, and INteractions-Knowledge Graph) a powerful data resource simplifying the integration of diverse and heterogeneous data resources such as plant metabolomes, traits, and biotic interactions.Conclusions The proposed knowledge graph provides an interface to interactively search for data relating plant metabolomes, traits, and interactions. This, in turn, will facilitate development of research questions in life-sciences. In this context, we provide representative case studies on how to frame queries that can be used to search for relevant data in the knowledge graph.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giag051), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 2：
  
  The authors present METRIN-KG, a knowledge graph integrating plant metabolomic, trait, and biotic interaction datasets. The work should have substantial value for multiple plant sciences and ecology domains. The overall effort to harmonize disparate resources into an integrated, semantically coherent resource is impressive. The methodology includes a notable pipeline for ontology alignment across multiple sources. However, despite its technical strengths, several issues regarding data accessibility and manuscript structure and clarity should be addressed. The manuscript is highly technical throughout. While this level of precision is exemplary, it may alienate biologically oriented readers, and effort should be made so that the impact and manuscript is clear to a larger audience.
  
  Major comments
  
  The online deployment currently presents several problems that must be resolved before publication.
  
  The existence of multiple SPARQL UIs (https://kg.earthmetabolome.org/metrin which redirects to https://qlever.earthmetabolome.org/metrin-kg/ and https://sib-swiss.github.io/sparql-editor/metrin-kg) is not explained, and the redundancy is potentially confusing.
  
  At the time of review, https://qlever.earthmetabolome.org/metrin-kg/ showed expired SSL certificates, making it inaccessible for most users. Automatic certificate renewal (e.g., using certbot) should be implemented.
  
  Attempting to use the SIB UI resulted in an error due to the redirect.
  
  The certificate issue also prevented evaluation of ExpasyGPT querying against METRIN-KG.
  
  Usability represents a significant barrier. Many potential users such as biologists without semantic-web or RDF experience are unlikely to be able to interpret the current figures or formulate SPARQL queries. The manuscript briefly mentions ExpasyGPT, which has strong potential to overcome this barrier by allowing natural language querying. This tool should be emphasized more prominently, potentially with a dedicated subsection and discussion of its role in broadening the resource's accessibility.
  
  To fully demonstrate the relevance of METRIN-KG, the use-case section would benefit from quantitative summaries, visualizations (e.g., distributions, network visualizations), and biological interpretations of query results.
  
  The overall manuscript organisation would benefit from restructuring to improve readability and better expose the impact of the work done.
  
  The methodological content currently in "Mapping of TRY data" and "Mapping of GloBI data" should be moved to the Methods section, while the quantitative outputs (e.g., numbers of records) should be moved into a dedicated Results section.
  
  The current "Data re-use and case studies" section could be reorganised into Results and Discussion sections,
  
  Results include: description of outputs from the methodological steps (e.g. the ontology, the successfulness of the mapping process, the size of the final knowledge graph/number of triples, and other relevant metrics; the user interface, including being able to share and add example questions and write NL questions via ExpasyGPT; examples of SPARQL queries and case studies.
  
  Discussion includes: Reuse potential; case-study interpretations or impact; future directions including planned expansion and enhancing of ontological structure, etc.
  
  An overview and evaluation of the KG should be provided, for example the number of plant species, and the distributions of connections. Any gaps or any (potentially) biases due to the input data needs to be acknowledged. For example, it is very likely certain species may be under/overrepresented in either metabolome or interaction datasets, or possibly geographic skews could exist.
  
  The ontology is variously referred to as the "Earth Metabolome Ontology", "EMI ontology", and "EMI". Consistent naming should be adopted throughout the manuscript and associated repositories. It is also unclear whether the ontology is a result of this work. As written, the "Ontology" section under "Methods" reads more like a result/description than a methodological step. Clarifying what components are original contributions and presenting them in Results would strengthen the manuscript. Additionally, the phrases "our proposed framework" and "our approach" are ambiguous, do these refer to the ontology itself, the metadata-mapping pipeline, or the overall integration process? Finally, referring to METRIN-KG as a "tutorial to build a knowledge graph" appears to be a bit out of place, given the topic of the manuscript.
  
  Given its potential relevance beyond this project, the authors are strongly encouraged to publish the code for the metadata-mapping pipeline. In addition, the following details would strengthen the methodological rigor:
  
  Were any acceptability criteria implemented in the automated step, e.g. a minimum Cosine similarity threshold?
  
  Were the manual corrections systematically documented?
  
  In how many cases were manual corrections needed?
  
  Was any evaluation done on the embedding/model version or the source of errors?
  
  Minor comments
  
  The manuscript would benefit from proofreading for consistent use of Oxford commas (and an "&" instead of "and").
  
  Reference 190 contains a typo "GiHub"
  
  References should be checked (e.g. citation for [95] references both METRIN-KG Zenodo and GloBI Zenodo)
  
  The METRIN-KG Zenodo link in the article is not to the latest version (Version 5)
  
  Consider improving the figures, e.g. use of colour to better communicate the content and refining layouts.
  
  The authors should deposit a snapshot of the GitHub repository to Zenodo.
  
  The following are suggestions to the authors, to be followed by their own judgement:
  
  Table 1, Figure 3, Figure 4 could be moved to supplementary material to make space for figures for the case studies.
  
  The dense in-text list of ontologies in the "Metadata mapping" section could be replaced with a summarized table (e.g. by moving Supplementary Table 2, but including references to the main text).
  
  The full SPARQL queries in "Taxonomy mapping" could be moved to supplementary materials, with a high-level description left in the main text.
2. GigaScience 10 Jul 2026
  
  in GigaScience
  
  AbstractBackground In recent years, biodiversity data management has emerged as a critical pillar in global conservation efforts. Today, the ability to efficiently collect, structure, and analyze biodiversity data is central to breakthroughs in conservation, drug development, disease monitoring, ecological forecasting, and agri-tech innovation. However, due to the vastness and heterogeneity of biodiversity data, it is often confined to databases for specific research areas in isolated formats and disconnected from other relevant resources. Crucial components of such data in kingdom Plantae comprise of metabolomes - the vast array of compounds produced by plants; traits - measurable characteristics of plants that influence their growth, survival, and reproduction, and that affect ecosystem processes; and biotic interactions - relationships of plants with other living organisms, affecting the ecosystem functions.Results In this work, we present METRIN-KG (MEtabolomes, TRaits, and INteractions-Knowledge Graph) a powerful data resource simplifying the integration of diverse and heterogeneous data resources such as plant metabolomes, traits, and biotic interactions.Conclusions The proposed knowledge graph provides an interface to interactively search for data relating plant metabolomes, traits, and interactions. This, in turn, will facilitate development of research questions in life-sciences. In this context, we provide representative case studies on how to frame queries that can be used to search for relevant data in the knowledge graph.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giag051), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 1：
  
  This study makes an important and timely contribution to plant ecology by combining multiple data sources of plant functional traits, metabolites, and interaction partners. Linking different aspects of a plants' phenotype (morphological and physiological traits, as well as metabolite profiles) with their potential ecological functions (biotic interactions) represents a much-needed step forward in both chemical and functional ecology. The effort the authors have put into compiling these datasets and establishing connections across sources is evident and deserves appreciation. The potential applications of this framework are manifold, and the examples provided on how the database can be used to explore research questions through knowledge paths convincingly demonstrate its value. The Introduction could be strengthened. At present, it feels somewhat long and diffuse in scope, which may make it harder for the reader to quickly grasp the importance of the contribution. Streamlining the text and sharpening the focus on the added value that linking different data sources provides would considerably improve clarity and help the reader more fully appreciate the strength of this novel approach. Minor comments: It would be helpful to specify more clearly what is meant by the "tremendous chemical diversity" that challenges metabolomic data analysis, and to clarify whether the "limited resources available to manage such complexity" refers to analytical limitations, data availability, or both. Similarly, the phrase "multi-level interaction data also lies locked in the guise of pairwise correlation metrics" could benefit from further explanation. As far as I am aware, the GLOBI database reports individual pairwise observations, which could already be seen as an effective way of documenting such interactions.
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2025.08.20.671289v3
academic.oup.com academic.oup.com

Stereo-cell deciphers the spatial and functional heterogeneity of polyploid hepatocytes

1
1. GigaScience 04 Jul 2026
  
  in Gigascience Annotations
  
  Yier Cai ,
  
  Very precocious contributor from Guangdong Country Garden School, who completely coincidentally has the same family name as the very wealthy in-laws of the journals Editor in Chief
Visit annotations in context

Annotators

GigaScience

URL

academic.oup.com/gigascience/article/doi/10.1093/gigascience/giag023/8503482
Jun 2026
www.biorxiv.org www.biorxiv.org

Transcriptomic profile of embryoid bodies under hypoxia at single cell level

2
1. GigaScience 19 Jun 2026
  
  in GigaByte
  
  Editors Assessment:
  
  This is a Data Release paper describing a mouse embryoid body single-cell RNA-seq dataset generated to study how oxygen availability shapes early cell differentiation. Acosta-Iborra et al. differentiated R1 mouse embryonic stem cells into embryoid bodies for 8 or 10 days, exposing them to hypoxia or normoxia for the final 16 or 48 hours of differentiation, then profiled thousands of cells per condition using droplet-based scRNA-seq from 10X. This yielding eight raw/filtered HDF5 count matrices across the four conditions. This was validated with flow cytometry, immunofluorescence, and EdU assays, confirming that hypoxia increased endothelial marker expression and vascular network complexity while inducing cell cycle arrest. This pattern mirrored transcriptionally, with hypoxic samples showing markedly higher proportions of cells in G0/G1 phase and elevated hypoxia gene-signature scores. QC analysis (and peer review in GigaByte) confirmed high data quality across samples, and conservative low-resolution clustering revealed a largely homogeneous progenitor population with a smaller, more differentiated subset. While there are limitations (mature endothelial cells were too sparse to robustly test the original hypothesis) the authors present this as an open, well-validated resource for comparative studies of hypoxia responses, benchmarking single-cell computational tools, and investigating early lineage specification and oxygen signaling more broadly.
  
  This evaluation refers to version 1 of the preprint
  
  Summary
2. GigaScience 19 Jun 2026
  
  in GigaByte
  
  AbstractOxygen availability is a key regulator of cellular physiology and hypoxia plays a central role driving vasculogenesis and angiogenesis during development. While bulk transcriptomics has revealed important oxygen-regulated gene networks, such approaches cannot resolve the cellular heterogeneity and lineage dynamics characteristic of early differentiation. To address this, we generated a single-cell transcriptomic dataset from murine embryoid bodies, a widely used in vitro model of early embryonic development, cultured 8 or 10 days under hypoxic (1% O2) or normoxic (21% O2) conditions for the final 16 or 48 hours of differentiation. This resource enables detailed exploration of how oxygen availability influences lineage specification, vascular and hematopoietic development, and cellular heterogeneity during early differentiation. Beyond developmental biology, the dataset provides a valuable reference for comparative studies of hypoxia responses, benchmarking of single-cell analysis methods, and integrative investigations into oxygen signaling across diverse biological systems.
  
  This paper is peer reviewed in GigaByte journal and the peer reviews are released under a CC-BY license. See: Bárbara Acosta-Iborra, Yosra Berrouayel, Laura Puente-Santamaría, Luis del Peso, Benilde Jiménez, Transcriptomic profile of embryoid bodies under hypoxia at single cell level, Gigabyte, 2026 https://doi.org/10.46471/gigabyte.178
  
  Reviewer 1. Gerardo Cordero
  
  Is the validation suitable for this type of data? Yes. The experimental effect was validated. Minor comments: 1)In the second line of the abstract please change the proposition ‘while' to ‘although’ 2)Which Illumina platform did you use? 3)Do you have quality control metrics for mitochondrial contamination? This can be used as an indicator of a reduction of cell viability during processing.
  
  Reviewer 2. Wei Zhang
  
  Is there sufficient detail in the methods and data-processing steps to allow reproduction?
  
  No. The experimental procedures are generally clear; however, the data analysis requires further improvement. More detailed descriptions of the data processing and analysis steps are needed. For example, specific parameters used in Cell Ranger should be explicitly reported. Additional downstream processing using commonly adopted tools such as Seurat to generate cell clustering results would be beneficial. This would not only provide an extra layer of data quality assessment, but also facilitate data reuse by enabling users to work directly with processed datasets without the need to perform a full reanalysis.
  
  Is there sufficient data validation and statistical analyses of data quality?
  
  No. The authors provide both biological and technical validations supporting the robustness of the dataset, and standard single-cell RNA-seq quality metrics indicate high technical quality. However, as noted above, additional downstream analyses could further characterize data quality, for example by estimating the proportion of doublets and assessing the fraction of cells with high mitochondrial gene content, among other commonly used metrics. The authors note that the data have been analyzed in a preprint manuscript; including more detailed analyses in the present manuscript would further strengthen the value of this data release.
  
  Is the validation suitable for this type of data?
  
  No. While the authors present solid biological and technical validations, and independent assays demonstrate hypoxia responses consistent with previous studies, the sequencing data represent the core contribution of this manuscript. Additional analyses leveraging the single-cell transcriptomic data to directly examine angiogenesis or endothelial expansion would further strengthen the validation and enhance the value of the dataset for reuse.
  
  Is there sufficient information for others to reuse this dataset or integrate it with other data?
  
  No. Although the reuse potential is clearly articulated, providing more concrete details on the structure and contents of the deposited data—such as the number of samples and file types—would further facilitate data reuse and integration.
  
  Additional Comments:
  
  This manuscript describes a well-designed single-cell RNA-seq dataset generated from murine embryoid bodies differentiated under controlled hypoxic and normoxic conditions. The experimental procedures are generally clear and methodologically sound, and the dataset has potential value as a community resource. However, several issues should be addressed:
  
  1） More detailed downstream analyses would strengthen the manuscript and better demonstrate the utility and quality of the dataset. 2）The overall data size is relatively limited, comprising only four experimental conditions/time points, which may restrict its broader applicability. 3）The authors state that these sequencing data have already been used in a preprint manuscript by the same group. It is therefore unclear whether the dataset remains appropriate for publication in this journal as a standalone Data Release.
  
  Re-review: The revised version has largely addressed the previous concerns. The methods are described in greater detail, the data presentation is more comprehensive, and the overall quality of the manuscript has improved substantially. However, I have some concerns regarding the cell clustering shown in Figure 5. Several clusters (e.g., clusters 6, 8, and 10) appear to be strong outliers. It would be helpful to further examine these subpopulations, for example by assessing whether they are influenced by batch effects and whether batch-effect correction using appropriate software is necessary, although it is also possible that these subclusters are biologically meaningful. In addition, performing doublet detection and filtering prior to re-clustering the cells would likely be more appropriate, as some outlier subclusters (e.g., cluster 10) might disappear after these steps.
Visit annotations in context

Tags

Summary

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2025.10.29.685315v1
Apr 2026
www.biorxiv.org www.biorxiv.org

The genome of the reef-building coral Porites harrisoni from the southern Persian/Arabian Gulf

2
1. GigaScience 29 Apr 2026
  
  in GigaByte
  
  Editors Assessment:
  
  This data paper is a genome note presenting the assembly of Porites harrisoni, a stony coral species endemic to the thermally extreme southern Persian Gulf. Using ONT PromethION long nanopore reads the final genome size encompassed 626.7 Mb across 1,883 contigs, achieving a BUSCO completeness of 86.3%. This revealed significant repeat content, comprising 59.23% of the nuclear genome and highlighting a diploid structure with predominant homozygosity. A total of 27,823 protein-coding genes were annotated from this assembly, facilitating discussions on thermal resilience under climate change. The research underscores the genomic framework supporting adaptive capacities in corals, with implications for evolutionary biology and conservation science, especially in context to ongoing ocean warming.
  
  This evaluation refers to version 1 of the preprint
  
  Summary
2. GigaScience 29 Apr 2026
  
  in GigaByte
  
  AbstractWe present a genome assembly from the coral species Porites harrisoni from the southern Persian/Arabian Gulf, the hottest ocean basin where corals live. The assembly is 626.7 Mb in size, spanning 1,883 contigs with a contig N50 of 807.4 kb, including a single-contig mitochondrial genome. The assembly has a BUSCO completeness of 86.3% (single = 72.5%, duplicated = 13.7%, fragmented = 1.2%, missing = 12.5%) using the eukaryota_odb10 reference set (n = 255). A total of 59.23% of the nuclear genome consists of repeats, comprising 15.89% retroelements, 10.00% DNA transposons, and 31.71% unclassified repeats. Gene annotation of this nuclear genome assembly identified 27,823 protein-coding genes. The mitogenome has an assembly size of 18,639 bp with 13 protein-coding genes as well as 2 tRNAs and 2 rRNAs. The genome of P. harrisoni provides a valuable genomic resource of a coral from an extreme environment, which will enable comparative analyses, enhancing our understanding of the genomic architecture underlying thermal resilience. Such comparisons will contribute to elucidating the evolutionary basis of heat tolerance and adaptive capacity of corals in the context of rapid climate change.
  
  This work has now been published in GigaByte under a CC-BY 4.0 license: https://doi.org/10.46471/gigabyte.174
  
  Reviewer 1. Oleg Simakov
  
  Overall, a very useful resource, the manuscript is clearly written and the data is consistently described. The genome assembly and annotation is well executed given the available data. Two minor suggestions: - include a sentence or two on potential genome assembly improvements (and if any pitfalls can be encountered), for example using HiC data and/or long-read (re)sequencing. - specify explicitly which "shorter reads" (Nanopore?) were used for ONT polishing in the assembly section and their amount.
  
  Reviewer 2. Yue Song
  
  Is there sufficient detail in the methods and data-processing steps to allow reproduction? Yes. The authors provide a standard and well-documented methodology: assembly, transposable-element annotation, and gene structural annotation all rely on widely used software and established pipelines. The parameter settings and post-assembly processing steps have been described. Is there sufficient data validation and statistical analyses of data quality? No. With respect to data quality, the authors note that the number of annotated protein-coding genes in Porites harrisoni is lower than that reported for other congeneric species, yet they offer no further discussion. This discrepancy is striking and warrants clarification: is it a biological reality reflecting gene loss or genome compaction in this species, or is it an artefact arising from differences in annotation pipelines, gene-model thresholds, or assembly completeness among studies? A concise comparative analysis—and explicit acknowledgment of methodological variables—would help readers properly interpret this genomic feature. Is there sufficient information for others to reuse this dataset or integrate it with other data? No. Although the authors present a valuable and rare coral genome assembly, the manuscript appears to offer only basic genomic data. There is limited elaboration on the declared aim of illuminating the molecular basis of thermal tolerance. In particular, after the structural annotation of protein-coding genes, no systematic functional characterization (e.g., GO/KEGG enrichment, comparative analyses of heat-stress-related gene families, or symbiosis-related pathways) is provided. This section seems to have been undertaken but is neither described nor discussed in the current version. Additional Comments: (1) The quality of the figures could be further improved. Specifically, in Figure 1, the phylogenetic tree appears to be hand-drawn and lacks the polish typically seen in published phylogenetic analyses. It is recommended that the authors refer to examples from other studies for guidance on improving visual quality. Additionally, the tree currently lacks common indicators of phylogenetic robustness, such as bootstrap values or other support metrics. (2) In panel A (Figure 1), species highlighted in brown are presumably those included in this study. It would be helpful to add a legend clarifying the meaning of the different font colors to improve readability. Furthermore, the labeling format for sub-figures is inconsistent across the manuscript—for example, “Figure 1A, B” in one instance and “Figure 2A, B” in another. Standardizing the labeling format throughout would enhance clarity and professionalism. (3) Line 273: There appears to be an error in the unit used for “average protein length.” If this value refers to the length of the encoded proteins, it should not be expressed in base pairs (bp). Please clarify the meaning and use the appropriate unit (e.g., amino acids).
Visit annotations in context

Tags

Summary

Annotators

GigaScience

URL

biorxiv.org/content/10.64898/2026.02.26.708201v1
www.biorxiv.org www.biorxiv.org

GEfetch2R: fetching single-cell/bulk RNA-seq data from public repositories to R and benchmarking the subsequent format conversion tools

2
1. GigaScience 28 Apr 2026
  
  in GigaScience
  
  AbstractBackground Downloading and reanalyzing the existing single-cell RNA sequencing (scRNA-seq) data provides an efficient choice to gain clues and new insights. However, no tool can fetch the diverse scRNA-seq data types (raw data, count matrix, and processed object) distributed in various repositories, process and load the downloaded data to R, convert formats between scRNA-seq objects, and benchmark the format conversion tools.Findings Here, we present GEfetch2R, an R package with Docker image to (i) download diverse scRNA-seq data types, including raw data (SRA and ENA), count matrices (GEO, UCSC Cell Browser, and PanglaoDB), and processed objects (Zenodo, CELLxGENE, and HCA); (ii) process the downloaded data, load output/downloaded count matrices and annotations to R (SeuratObject/DESeqDataSet), filter the SeuratObject based on cell metadata and genes, and merge multiple SeuratObjects if applicable; (iii) convert formats between the widely used scRNA-seq objects, including SeuratObject, AnnData, SingleCellExperiment, CellDataSet/cell_data_set, and loom, and benchmark format conversion tools in terms of information kept, usability, running time, and scalability to guide the tool selection. Furthermore, GEfetch2R can also download, process, and load bulk RNA-seq raw data (SRA and ENA) and count matrices (GEO) to R (DESeqDataSet).Conclusions GEfetch2R is an R package dedicated to facilitating researchers to access and explore the existing gene expression data from various public repositories. It can function as a data downloader (supports all three scRNA-seq and two bulk RNA-seq data types), a data processor (processes and loads the output/downloaded count matrices and annotations to R), and an object format converter (between the widely used scRNA-seq objects).
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giag039), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 2:
  
  General Comments This manuscript introduces a tool named HVRLocator, designed to address the issue of missing or non-standard metadata in 16S rRNA sequencing data found in public databases such as the SRA. The tool identifies amplicon regions by aligning sequences to a reference genome and attempts to detect the presence of primers using a machine learning model. This is a subject with significant practical value, particularly for conducting large-scale meta-analyses. However, there are still many issues regarding methodological rigor, the depth of validation, and comparisons with existing tools that require further clarification by the authors. Major Comments 1. Concerns regarding the singularity of the reference sequence The authors mention aligning sequences to a single Escherichia coli (J01859.1) reference genome to determine start and end positions. Is a single E. coli reference sufficient to cover Archaea or bacterial phyla that are distantly related to Proteobacteria, which may be present in environmental samples (e.g., soil, ocean)? For taxa with significant length variations or insertions/deletions (Indels), could forced alignment to the E. coli reference lead to misjudgment of start/end positions? Have the authors evaluated the impact on accuracy if a more universal reference database (such as representative sequences from SILVA or Greengenes) were used? 2. Rationality of the primer detection model (Random Forest based on Quality Scores) The authors developed a Random Forest model to predict primer presence by analyzing the quality score distribution of the first 1,000 reads. Primer detection is typically based on the sequence itself rather than quality scores. Can the authors explain why quality scores were chosen as features? Sequencing quality scores are influenced by technical factors such as sequencer status, reagent batches, and run cycles, which have no direct biological correlation with the presence of primers. Is there a risk that this model is "overfitting" specific sequencing platforms or datasets? Since the reads are already downloaded, why not directly use degenerate primer sequence matching (e.g., using Cutadapt or SeqKit logic) to determine primer presence? This seems to be a more direct and accurate method. 3. Verification of accuracy claims In the validation section, the authors claim to achieve 100% accuracy on certain datasets. In bioinformatics tool development, a claim of 100% accuracy is often a red flag. Have the authors manually checked those samples marked as "correct" by the model that might suffer from edge effects or borderline cases? 4. Dataset imbalance in the Random Forest model For the Random Forest model, the authors used 882 samples with primers and 8,940 samples without primers for training. Such an extremely imbalanced dataset, even with stratified sampling, may cause the model to be biased towards the majority class. 5. Comparison with existing tools The manuscript mentions that no tool has been designed for this specific purpose, but this may overlook some existing general-purpose tools or scripts. Many pipelines (such as certain plugins in QIIME 2, USEARCH, etc.) possess functionalities to identify primers or evaluate amplicon regions. The authors should discuss how their tool compares to these existing workflows. Minor Comments 1. Confusion regarding processing speed metrics The abstract mentions a processing speed of "0.147 samples per minute", but later the text mentions "6.5 samples per minute" and "one sample every 0.147 minutes". There is confusion regarding units and values in these three descriptions (is it samples per minute or minutes per sample?). Please unify and correct these data to ensure consistency. 2. Usage of fastq-dump The use of fastq-dump is mentioned. The SRA Toolkit's fastq-dump is relatively slow and has largely been superseded by fasterq-dump for efficiency. Why did the authors not use the more efficient fasterq-dump? 3. Definition of "Standardized metadata" The term "standardized metadata" is used frequently. Please explicitly define what constitutes "standard" metadata in the context of this tool within the text. 4. Robustness and error handling The results section mentions that some samples failed due to "NCBI portal-related issues". Does this imply the tool lacks breakpoint resumption or retry mechanisms? Given that network fluctuations are common during large-scale downloads, how is the tool's robustness demonstrated? 5. Output confidence intervals The output file contains "TRUE/FALSE" and a probability score. For samples where the probability score is at a critical threshold (e.g., around 0.5), does the tool provide an "uncertain" tag, or does it force a classification? It is suggested to add an indicator for ambiguous ranges.
2. GigaScience 28 Apr 2026
  
  in GigaScience
  
  AbstractBackground Downloading and reanalyzing the existing single-cell RNA sequencing (scRNA-seq) data provides an efficient choice to gain clues and new insights. However, no tool can fetch the diverse scRNA-seq data types (raw data, count matrix, and processed object) distributed in various repositories, process and load the downloaded data to R, convert formats between scRNA-seq objects, and benchmark the format conversion tools.Findings Here, we present GEfetch2R, an R package with Docker image to (i) download diverse scRNA-seq data types, including raw data (SRA and ENA), count matrices (GEO, UCSC Cell Browser, and PanglaoDB), and processed objects (Zenodo, CELLxGENE, and HCA); (ii) process the downloaded data, load output/downloaded count matrices and annotations to R (SeuratObject/DESeqDataSet), filter the SeuratObject based on cell metadata and genes, and merge multiple SeuratObjects if applicable; (iii) convert formats between the widely used scRNA-seq objects, including SeuratObject, AnnData, SingleCellExperiment, CellDataSet/cell_data_set, and loom, and benchmark format conversion tools in terms of information kept, usability, running time, and scalability to guide the tool selection. Furthermore, GEfetch2R can also download, process, and load bulk RNA-seq raw data (SRA and ENA) and count matrices (GEO) to R (DESeqDataSet).Conclusions GEfetch2R is an R package dedicated to facilitating researchers to access and explore the existing gene expression data from various public repositories. It can function as a data downloader (supports all three scRNA-seq and two bulk RNA-seq data types), a data processor (processes and loads the output/downloaded count matrices and annotations to R), and an object format converter (between the widely used scRNA-seq objects).
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giag039), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 1:
  
  The manuscript presents GEfetch2R, an R package (with a Docker image) that fetches scRNA-seq and bulk RNA-seq data from multiple repositories, loads the data into R objects, and benchmarks format-conversion tools. The problem addressed is real and important; the implementation appears practical and well documented. I see strong potential for adoption. Major comments
  
  1) Robust cross-repository support for .RData files While GEfetch2R lists rdata among supported extensions for Zenodo and HCA, many GEO submissions and other archives still provide processed data exclusively as .RData, often bundling matrices and metadata in heterogeneous objects. Please add an explicit, repository-agnostic .RData ingestion path with: (i) automatic object introspection, (ii) standardized extraction of matrices/metadata, (iii) graceful fallbacks with clear diagnostics for non-standard objects, and (iv) reproducible examples. This materially increases real-world coverage.
  
  2) Large-scale, automated evaluation on ~100 scRNA-seq datasets Beyond the single COVID-19 application and the conversion benchmark, please include a systematic "fetch success-rate" study across ~100 GEO scRNA-seq datasets. Provide a Dockerized workflow (publicly available) that periodically attempts end-to-end retrieval (raw / count / processed) and reports success/failure rates stratified by repository and file type, with resource/time footprints and categorized failure causes. Given heterogeneous deposition practices, even ~50% overall success would be informative.
  
  3)Another very important point is to provide a Dockerfile together with the Docker. Minor revisions
  
  "altas" → atlas (COVID-19 section title/caption).
  
  "Count maatrix" → Count matrix (Figure 3 caption/table column).
  
  "PanglanDB" → PanglaoDB (tables).
  
  Consistency: keep SeuratObject (not "Seurat object"); keep rds lowercase;
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2023.11.18.567507v2
www.biorxiv.org www.biorxiv.org

HVRLocator: A Computationally Efficient Tool for Identifying Hypervariable Regions in 16S rRNA Big Datasets

2
1. GigaScience 28 Apr 2026
  
  in GigaScience
  
  AbstractBackground Amplicon sequencing of the 16S rRNA gene is widely used to assess microbial diversity due to its cost-effectiveness and efficiency. However, public 16S rRNA datasets often lack standardized metadata, particularly information on the sequenced hypervariable regions or primers used, which are critical for accurate analysis and data reuse. To address this, we present the HVRLocator, a computational tool that reliably identifies sequenced hypervariable regions, enhancing metadata quality and enabling more robust large-scale microbiome studies.Results The HVRLocator tool processed samples at an average rate of 0.147 per minute. Validation confirmed 100% accuracy in predicting alignment positions, correctly matching sequences to the expected primer regions based on literature. We demonstrated how to use the tool to select appropriate and comparable sequences for building a global bacterial database from V4 region amplicons of the 16S rRNA gene. Using HVRLocator, we selected 36,217 valid samples out of 45,882 runs, enabling us to identify cases where metadata incorrectly labeled sequences as targeting the V4 region.Conclusion Even when metadata is available, it can be inaccurate or misleading. HVRLocator offers a reliable and efficient method to identify the exact hypervariable sequenced region, ensuring accurate processing of large-scale 16S rRNA amplicon data. By bypassing inconsistent metadata and literature, it streamlines data curation and enhances the reliability of microbial studies, syntheses, and meta-analyses. Its use is essential for critically evaluating published data and enabling accurate and reproducible research in microbial ecology.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giag040), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 2:
  
  General Comments This manuscript introduces a tool named HVRLocator, designed to address the issue of missing or non-standard metadata in 16S rRNA sequencing data found in public databases such as the SRA. The tool identifies amplicon regions by aligning sequences to a reference genome and attempts to detect the presence of primers using a machine learning model. This is a subject with significant practical value, particularly for conducting large-scale meta-analyses. However, there are still many issues regarding methodological rigor, the depth of validation, and comparisons with existing tools that require further clarification by the authors. Major Comments 1. Concerns regarding the singularity of the reference sequence The authors mention aligning sequences to a single Escherichia coli (J01859.1) reference genome to determine start and end positions. Is a single E. coli reference sufficient to cover Archaea or bacterial phyla that are distantly related to Proteobacteria, which may be present in environmental samples (e.g., soil, ocean)? For taxa with significant length variations or insertions/deletions (Indels), could forced alignment to the E. coli reference lead to misjudgment of start/end positions? Have the authors evaluated the impact on accuracy if a more universal reference database (such as representative sequences from SILVA or Greengenes) were used? 2. Rationality of the primer detection model (Random Forest based on Quality Scores) The authors developed a Random Forest model to predict primer presence by analyzing the quality score distribution of the first 1,000 reads. Primer detection is typically based on the sequence itself rather than quality scores. Can the authors explain why quality scores were chosen as features? Sequencing quality scores are influenced by technical factors such as sequencer status, reagent batches, and run cycles, which have no direct biological correlation with the presence of primers. Is there a risk that this model is "overfitting" specific sequencing platforms or datasets? Since the reads are already downloaded, why not directly use degenerate primer sequence matching (e.g., using Cutadapt or SeqKit logic) to determine primer presence? This seems to be a more direct and accurate method. 3. Verification of accuracy claims In the validation section, the authors claim to achieve 100% accuracy on certain datasets. In bioinformatics tool development, a claim of 100% accuracy is often a red flag. Have the authors manually checked those samples marked as "correct" by the model that might suffer from edge effects or borderline cases? 4. Dataset imbalance in the Random Forest model For the Random Forest model, the authors used 882 samples with primers and 8,940 samples without primers for training. Such an extremely imbalanced dataset, even with stratified sampling, may cause the model to be biased towards the majority class. 5. Comparison with existing tools The manuscript mentions that no tool has been designed for this specific purpose, but this may overlook some existing general-purpose tools or scripts. Many pipelines (such as certain plugins in QIIME 2, USEARCH, etc.) possess functionalities to identify primers or evaluate amplicon regions. The authors should discuss how their tool compares to these existing workflows. Minor Comments 1. Confusion regarding processing speed metrics The abstract mentions a processing speed of "0.147 samples per minute", but later the text mentions "6.5 samples per minute" and "one sample every 0.147 minutes". There is confusion regarding units and values in these three descriptions (is it samples per minute or minutes per sample?). Please unify and correct these data to ensure consistency. 2. Usage of fastq-dump The use of fastq-dump is mentioned. The SRA Toolkit's fastq-dump is relatively slow and has largely been superseded by fasterq-dump for efficiency. Why did the authors not use the more efficient fasterq-dump? 3. Definition of "Standardized metadata" The term "standardized metadata" is used frequently. Please explicitly define what constitutes "standard" metadata in the context of this tool within the text. 4. Robustness and error handling The results section mentions that some samples failed due to "NCBI portal-related issues". Does this imply the tool lacks breakpoint resumption or retry mechanisms? Given that network fluctuations are common during large-scale downloads, how is the tool's robustness demonstrated? 5. Output confidence intervals The output file contains "TRUE/FALSE" and a probability score. For samples where the probability score is at a critical threshold (e.g., around 0.5), does the tool provide an "uncertain" tag, or does it force a classification? It is suggested to add an indicator for ambiguous ranges.
2. GigaScience 28 Apr 2026
  
  in GigaScience
  
  AbstractBackground Amplicon sequencing of the 16S rRNA gene is widely used to assess microbial diversity due to its cost-effectiveness and efficiency. However, public 16S rRNA datasets often lack standardized metadata, particularly information on the sequenced hypervariable regions or primers used, which are critical for accurate analysis and data reuse. To address this, we present the HVRLocator, a computational tool that reliably identifies sequenced hypervariable regions, enhancing metadata quality and enabling more robust large-scale microbiome studies.Results The HVRLocator tool processed samples at an average rate of 0.147 per minute. Validation confirmed 100% accuracy in predicting alignment positions, correctly matching sequences to the expected primer regions based on literature. We demonstrated how to use the tool to select appropriate and comparable sequences for building a global bacterial database from V4 region amplicons of the 16S rRNA gene. Using HVRLocator, we selected 36,217 valid samples out of 45,882 runs, enabling us to identify cases where metadata incorrectly labeled sequences as targeting the V4 region.Conclusion Even when metadata is available, it can be inaccurate or misleading. HVRLocator offers a reliable and efficient method to identify the exact hypervariable sequenced region, ensuring accurate processing of large-scale 16S rRNA amplicon data. By bypassing inconsistent metadata and literature, it streamlines data curation and enhances the reliability of microbial studies, syntheses, and meta-analyses. Its use is essential for critically evaluating published data and enabling accurate and reproducible research in microbial ecology.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giag040), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 1:
  
  Metabarcoding data are accumulating rapidly. This paper makes a very valuable contribution to the automated extraction and curation of metabarcoding data and should be of great value in facilitating the re-use of existing data and the construction of custom databases based on these. I have not tested or tried to install the software myself, as the manuscript provided sufficient detail to enable me to assess the tool
  
  General comments
  
  The manuscript is written entirely in terms of "bacteria" and aligns amplicons to an E. coli model sequence. This is reasonable, but there should certainly be some acknowledgement of Archaea and ideally some mention of Eukaryotes too. These are probably things for the discussion section of this manuscript, but the authors may wish to consider whether a future version of the program could contain options to use model Archaea and Eukaryote sequences as alternatives to the E. coli model. It would also be helpful to assess how the program with its E. coli model deals with sequence data from Archaea, Eukaryotes (including mitochondria) and bacteria that are very divergent from E. coli. The methods section does not contain details of software used to generate the figures, or whether these figures are produced by "the pipeline" or by separate analysis of the .txt file that the pipeline produces. I suspect that it is that latter, in which case making the authors should make the scripts used available - as well as providing complete documentation of what has been done, this is likely to increase use made of the tool. And it would be helpful to include an output file in the supplementary materials
  
  Specific comments
  
  Line 64 "however the integration of these data in light of processing metadata" - not clear
  
  Line 67-8 "though bacterial diversity increases linearly with amplicon length". Needs re-wording. The number of ASVs will increase with amplicon length, but the actual bacterial diversity in a sample is constant.
  
  Line 79 "Wasimuddin and colleagues" should be "Wasimuddin et al". More generally, check that citations conform with journal house style
  
  Line 79-82 "For example, Wasimuddin and colleagues [8] found that compared to three other primer sets targeting different regions, the primer pair targeting the V4 hypervariable region of the 16S rRNA gene produced the highest estimates of species richness and diversity across various sample types" There are three issues here: 1) different primer pairs vary in their coverage and bias, so different primers targeting the same variable region will produce different numbers of ASVs 2) Even with complete coverage and the absence of bias, different variable regions will generate different numbers of ASVs as a result of differences in length and rate of evolution between variable regions (and differences in the number of ASVs that are clustered into OTUs at a particular sequence similarity threshold 3) The relationship between ASVs or OTUs and "species" is not straightforward (Edgar, 2018). At minimum "species" should be replace with ASV or OTU (whichever Wasimuddin et al used)
  
  Line 89-90 "as bacterial diversity and taxonomic resolution linearly increase with target sequence length [12]." Overlaps with statement made in line 67-8, and the same issue applies here.
  
  Lines 167-170. The output file contains (amongst other things) "Predicted HV region Start/End: Predicted hypervariable (HV) region based on the median alignment start and end positions across all reads, inferred from literature on conserved and hypervariable regions of the 16S rRNA gene (Brosius et al., 1978; Yang et al., 2016)". This implies that the program predicts a single variable region for each study - I am not clear what this column will contain for amplicons that contain more than one variable region, although columns 11-19 indicate that the program identifies the presence/absence of each of the 9 HV regions. My guess is that the authors are using "HV region" in two different sense: 1) Its usual meaning of one region out of V1 to V9 2) The sequence from the beginning of the first of the nine variable regions the amplicon includes to the end of the last. It would also be helpful to indicate whether the sequence positions here are relative to the E coli model or refer to sequence positions in the amplicon
  
  Edgar, R. C. (2018). Updating the 97% identity threshold for 16S ribosomal RNA OTUs. Bioinformatics, 34(14), 2371-2375. doi:10.1093/bioinformatics/bty113
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2025.07.24.666487v1
www.biorxiv.org www.biorxiv.org

MicroFinder: Conserved gene-set mapping and assembly ordering for manual curation of bird microchromosomes

2
1. GigaScience 28 Apr 2026
  
  in GigaScience
  
  AbstractObtaining chromosomally complete genome assemblies across the tree of life is a major goal of biodiversity genomics. However, some lineages remain recalcitrant to assembly despite recent advances in sequencing technologies and assembly tools. Birds present a substantial genome assembly challenge due to the presence of tiny, hard to assemble microchromosomes that are often highly fragmented or even missing in draft genome assemblies. As such, bird genomes require a large amount of expert manual curation effort via manipulation of genome-wide Hi-C contact maps and many current chromosome-level bird genome assemblies do not resolve the known karyotype. Microchromosomes have distinct genetic and epigenetic features. They are GC-biased, gene-rich, highly methylated, and have distinct spatial organisation in the centre of the nucleus. Importantly, they are conserved across avian evolution. Here, using a reference set of expert curated bird genomes, we have identified a set of conserved microchromosome genes and developed MicroFinder, a pipeline that uses this gene set to find small microchromosome fragments in draft genome assemblies to act as anchors for manual curation of microchromosomes. We demonstrate how MicroFinder can be used to improve the speed and accuracy of bird genome curation. Furthermore, we highlight the usefulness of MicroFinder by carrying out MicroFinder-enabled re-curation of 12 previously released chromosome-scale bird genome assemblies, increasing the sequence content of microchromosome models.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giag036), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 2:
  
  I had the privilege of reviewing the manuscript titled "MicroFinder: conserved gene-set mapping and assembly ordering for manual insertion of bird microchromosomes" by Mathers et al. The manuscript presents a conserved gene set linked to bird microchromosomes for identifying putative contigs/scaffolds. Subsequently, microchromosomes contigs/scaffolds can be made into their corresponding chromosome models using orthogonal evidence from HiC data. MicroFinder utilises the current knowledge of microchromosome conservation across birds. This approach is similar to assembly evaluation method using BUSO genes.
  
  One of the major limitation of the manuscript is the lack of validation or supportive evidence to show that manual curation results after applying MicroFinder hints are valid and robust. Authors can perform local synteny or chromosome scale alignments analyses and conservation property evaluation to demonstrate that results of assembly curation are valid. Authors can also report metrics of HiC contact maps before and after curation for inter and intra chromosomes contacts to demonstrate improvements. If this is not done, authors may have to remove results and methods corresponding to manual curation so as to focus on genes that are found in "putative" microchromosomes.
  
  Manuscript is generally well written with some minor concerns. Analyses presented are generally robust.
  
  It was confusing to read the difference between micro and dot chromosomes. I encourage authors to avoid "dot" chromosome term. Although it has been used in literature in the past, we can do without that term. There is no strong evidence to suggest if micro and dot chromosomes have any significant functional or system level differences. Best to avoid the term.
  
  If authors insist on using the dot nomenclature, a justification and explanation would be required with clear definitions for both. Also, the name of the workflow may need to change as well. I leave it up to authors to make that call.
  
  Similarly I encourage authors avoid using the term shrapnel for small unplaced contigs. Just use small unplaced contigs instead.
  
  Finding section contains a lot of information that belongs in methods section. For example line numbers 109-117 122-125 135-137 154-156 160-164 167-172 187-192. Please revise the text so that findings section doesn't have any methods description.
  
  A definition of what is a orthogroup and fuzzy orthogroup is required.
  
  Result/findings section needs significant improvements. Authors have relegated much of the results to tables in supplementary information. I insist that authors summarise those results in a meaningful descriptive way and refer to supplementary information for extra details.
  
  Lines 176-177 mentions about the manual curation of micro chromosomes. I would like to see the rules and decisions that were employed to join or break or reorder contigs/scaffolds into a chromosome model.
  
  Authors have mentioned that 216kb-4.3mb of additional content per assembly was added. This is incorrect as the sequence content was already present in the assembly. It is just reorganised into microchromosome scaffolds. Please correct the text to say that unplaced scaffolds are organised into putative microchromosomes.
  
  Lines 108-199 mentions about errors in original assembly. A description about the type of errors would be required.
  
  Authors should discuss the property of eagles, falcons and parrots with rearranged/fused micro chromosomes. The proposed method may not be effective in such instances.
  
  Authors suggest the use of 5Mbp cut off. However, in instances where a micro chromosome is incorrectly placed with a macro- chromosome may miss these instances. Authors discuss this as paralog or misalignment related issues. I suggest that authors provide a metric for the success/failure of identifying genes similar to BUSCO. Authors can run the software on all available bird genomes to define the property of such metric for each gene. Result section can explain proportions of 9400 found on macro vs micro. Proportions of 14k fuzzy genes on micro vs macro, their copy status. 9400 + 14514 doesn't add up to 16,589 orthogroup. Something is not clearly described about those numbers. Please improve the text to make meaningful assessments of conserved gene sets on Microchromosomes for it to be useful for the research community.
  
  Methods: Lines 233-234: what is taxon in this context? Please clarify. There is also a mention of taxa with missing data. What data were missing? Please clarify.
  
  Lines 236-237: do authors mean that chromosomes identified by the submitter of primary assembly? Please clarify.
  
  For each species, authors should refer to refseq version of the assembly for posterity as well. Common names of species may be useful too for broad readership.
  
  Line 254: please modify the section header to remove assembly version as they are not useful
  
  Methods describing the orthogroup clustering should include details about how alignments were filtered and processed. This is currently missing.
  
  Significance of phylogenetic analyses in the context of manuscript is not very clear. May be remove that section. Perhaps authors can utilise the phylogenetic distance as a way to discuss how conserved gene sets are behaving between species based on distance.
  
  Results section can include run time and compute resource usage metrics for others to estimate resource requirements for such analyses.
  
  Updated assemblies can be submitted to NCBI. Authors should consider this.
2. GigaScience 28 Apr 2026
  
  in GigaScience
  
  AbstractObtaining chromosomally complete genome assemblies across the tree of life is a major goal of biodiversity genomics. However, some lineages remain recalcitrant to assembly despite recent advances in sequencing technologies and assembly tools. Birds present a substantial genome assembly challenge due to the presence of tiny, hard to assemble microchromosomes that are often highly fragmented or even missing in draft genome assemblies. As such, bird genomes require a large amount of expert manual curation effort via manipulation of genome-wide Hi-C contact maps and many current chromosome-level bird genome assemblies do not resolve the known karyotype. Microchromosomes have distinct genetic and epigenetic features. They are GC-biased, gene-rich, highly methylated, and have distinct spatial organisation in the centre of the nucleus. Importantly, they are conserved across avian evolution. Here, using a reference set of expert curated bird genomes, we have identified a set of conserved microchromosome genes and developed MicroFinder, a pipeline that uses this gene set to find small microchromosome fragments in draft genome assemblies to act as anchors for manual curation of microchromosomes. We demonstrate how MicroFinder can be used to improve the speed and accuracy of bird genome curation. Furthermore, we highlight the usefulness of MicroFinder by carrying out MicroFinder-enabled re-curation of 12 previously released chromosome-scale bird genome assemblies, increasing the sequence content of microchromosome models.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giag036), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 1:
  
  I am very happy to see that MicroFinder is going to be published! Last year I used it very often to curated the bird assemblies. I found no major issues, but only the minor one.
  
  The only crucial (but still technical issue) is that your protein dataset is from dot microchromosomes, i.e. not from the all microchromosomes. So I highly recommend to use "dot microchromosomes" where relevant including the title of the manuscript.
  
  Minor issues:
  
  row 19 (Abstract background) change "major goal" to a softer statement. Generation of the assemblies is a very important task of bioiversity genomics but not a major one
  
  row 54-55 Do you imply that typical bird genome contains 37-41 chromosome pairs? There are a lot of birds with lower number of chromosome, so i am not sure that it is typical.. Also a reference to publication from 1981 looks outdated
  
  row 109 - why only eleven assemblies were selected?
  
  row 111 - 112 Please, highlight how many orders/families were not covered
  
  rows 129 - 137 This lines are in some contradiction with all the text including the abstract. Your dataset is focused on a dot chromosomes and not on the all microchromosomes. I suggest to replace "microchromosomes" nearly everywhere to "dot microchromosomes" including the title
  
  row 173 - 185 I am very skeptical about expanding the results obtained on a single genome assembly to the whole family, especially if remember that your dataset covers less than a half of bird orders. My experience with Microfinder tells that sometimes it select contigs/scaffold belonging to macrochromosomes. However, not many and they are usually short. Please, soften statements
  
  row 429 Reference 13 is in French and doesn't have an English translation of the title
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2025.05.09.653066v1
www.biorxiv.org www.biorxiv.org

Patterns of aDNA Damage Through Time and Environments – lessons from herbarium specimens

3
1. GigaScience 10 Apr 2026
  
  in GigaScience
  
  AbstractHerbarium collections are a vast but underutilized resource for ancient DNA research, containing over 400 million specimens with detailed metadata and spanning centuries of global biodiversity. Understanding patterns of DNA preservation in natural collections is crucial for optimizing ancient DNA studies and informing future curation practices. We analysed genomic data for 573 herbarium specimens from six plant species from the genera Hordeum and Oryza collected from the Americas and Eurasia over 220 years. Using standardized laboratory protocols and shotgun sequencing, we quantified DNA degradation and elucidated factors that accelerate it. We find significant age-dependent DNA fragmentation rates, indicating temporal degradation processes not detected in prehistoric samples. In our analysis, DNA decay rates in herbarium specimens were almost eight times faster than in moa bones, reflecting fundamental differences in tissue composition and preservation environments. Environmental conditions at the time of specimen collection emerged as the major determinants of post-mortem damage rates, with the interaction term between temperature and genus being the dominant driver of cytosine deamination. We find no effect of sample storage on DNA damage and degradation. These findings provide insights into how climatic origin, preservation environment, taxonomic identity and age influence DNA preservation while highlighting opportunities for improving institutional preservation practices. Due to standardised preservation conditions, museum collections can provide better insights into DNA damage and degradation over time than archaeological and paleontological samples.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giag026), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 3:
  
  I read this work with great interest, and I believe it represents an excellent contribution to our understanding of aDNA preservation, particularly welcome for plants, since most studies in this field are usually carried out on animal tissues, bones, and similar materials. The authors show that ancient DNA (aDNA) damage in herbarium specimens results from a combination of temporal, environmental, and biological factors, with storage conditions affecting decay rates. Their results indicate that DNA fragmentation increases in dry plant tissue with sample age, it varies between genera, and that temperature is the main driver of cytosine deamination. I agree with these interpretations, but the discussion can emphasize more the roles of water and oxidation in DNA degradation. Rapid drying of herbarium specimens limits hydrolytic damage but may increase the oxidative processes, on the contrary, animal or arthropod specimens dry more slowly, andthis allows different degradation dynamics. Considering these differences in the discussion can further clarify the mechanisms behind the observed patterns, especially across museum tissue types.
  
  In the study, the methodologiies vare solid. The approaches used to estimate endogenous DNA content is appropriate, though applying a mapping quality threshold could strengthen the calculation. Methods for assessing DNA fragmentation, for DNA damage, and for decay rates, and 5' C→T substitutions seem robust and oprimal for validating aDNA authenticity. The climate analyses also appear sound but I cannot provide detailed evaluations on this part due to limited expertise in this area.
  
  The explanation for the correlation between fragment length and sample age it seems logical. Unlike animals, where DNA decay occurs in two phases, plant tissue death is instead gradual and diverse depending on tissue, and this allows enzymatic and microbial degradation to continue over longer periods, contributing to the strong age-fragmentation relationship. Overall, the study highlights the importance of tissue type and storage conditions on DNA decay; however discussing how hydrolytic and oxidative processes differ between herbarium plants and other specimen types (animal) would further strengthen the interpretation of the decay rates.
  
  Specific comments
  
  The terminology related to ancient DNA preservation (e.g., DNA damage, DNA degradation, DNA decay) should be clarified and used more consistently throughout the text. These terms describe distinct processes, and specifying the intended meaning for each will improve precision and avoid confusion for the reader. DNA damage refers to specific chemical lesions; DNA degradation describes the physical fragmentation of DNA molecules; and DNA decay refers to the temporal process or rate at which DNA deteriorates over time.
  
  The two most prominent reactions associated with DNA degradation are deamination (resulting with spontaneous substitutions of cytosine residues to uracil) and depurination (breakage of the phosphodiester bond resulting in DNA backbone fragmentation). In view of the comment above on the terminology used, I believe that the sentence above conflates different processes: deamination is a form of DNA damage, whereas depurination leads to DNA degradation through strand fragmentation. I suggest the terminology in the paper should be modified to reflect this distinction. Even if the authors do not wish to adopt this terminology I suggest that they clarify the terms more clearly at the beginning.
  
  Line 106: …six plant species, spanning…
  
  Line 98, 105: In this context, it is not appropriate to refer to deamination-induced substitutions as "mutations," since they represent post-mortem chemical damage rather than random biological changes (mutations) that occurr in vivo. In addition, introducing this new term complicates even more the terminology presented in the previous comments.
  
  Line 116-118: I wonder if the sampling coverage for Hordeum, with highest counts in arid and warm regions, may be incomplete, as certain regions, such as northern Europe (e.g., Scandinavia) or Russia are not represented. These species are cultivated in Russia, Denmark, southern Sweden, I believe. Should this limitation be acknowledged as it could affect the generality of the conclusions especially regarding temperatures?
  
  It is unclear why the study included only wild Oryza species (O. alta, O. grandiglumis, O. latifolia, O. rufipogon), whereas for Hordeum the cultivated Hordeum vulgare was used. Perhaps, including Oryza sativa can provide more information on DNA preservation in domesticated material and allow a more consistent comparison across genera?
  
  Table 1: Draw a line above the last row (Total)
  
  Line 140: Oryza should be in italics
  
  Line 140: why 58 Oryza (30 O. latifolia, 18 O. rufipogon and 10 O. grandiglumis)? Why not all Oryza samples.
  
  From line 169, it appears that an additional 287 Oryza samples from different origins (KAUST) were used, but it is not clear (not explained) if these are herbarium specimens, and why this origin (KAUST) is not included in Table 1. Perhaps it would be better to explain at the beginning of this paragraph that there are two subsets of samples and to clarify the content of Table 1 more clearly.
  
  Line 143: it is not specified which part of the herbarium material was used. I assume leaves, but this should be clearly stated
  
  Line 149: Please clarify what "gDNA" refers to; genomic DNA? Since you spell out "genomic DNA" elsewhere in the paragraph, the abbreviation here seems unnecessary.
  
  Line 149: Why was only a subset used? Please explain and provide a rationale.
  
  Line 154: were the libraries constructed only on this subset as well?
  
  Line 162: Fragment size: The first letter of the sentence should be capitalized.
  
  Lines 165-169: It is not clear for me how the different subsets of samples were used in this study. Here it is stated that all barley samples (but how many exactly?) were sequenced on NovaSeq in a specific place, whereas only 40 rice samples (from the initial subset of how many?) were sequenced on another NovaSeq platform and at a different institute. Also, the 287 samples from KAUST are seqeunced on a MiSeq that has lower output compared to NovaSeq. Somehow, it is necessary to explain how the initial 573 samples were selected and used for all analyses. Also, the 287 samples from KAUST were processed in an ancient DNA lab, but what about all the other samples? It would be strange if a specialized laboratory for ancient DNA analyses was not used for all samples. In this regard, it should also be noted that the issue of contamination is not mentioned in the manuscript, although it was certainly considered by the authors; for example, by indicating whether negative controls (blank samples) were used and how they were processed. Certainly, the C>T signal ensures that we are dealing with authentic ancient sequences, but this should be highlighted and explained more clearly.
  
  Line 189: Why was it aligned to Oryza glumipatula (a new species not mentioned before?) and not against Oryza rufipogon? The authors report measuring gDNA fragment size distributions on a subset of 40 samples. It would be helpful if they could provide a motivation for why this subset was chosen, and how it is representative of the full dataset, to clarify the rationale behind not analyzing all samples.
2. GigaScience 10 Apr 2026
  
  in GigaScience
  
  AbstractHerbarium collections are a vast but underutilized resource for ancient DNA research, containing over 400 million specimens with detailed metadata and spanning centuries of global biodiversity. Understanding patterns of DNA preservation in natural collections is crucial for optimizing ancient DNA studies and informing future curation practices. We analysed genomic data for 573 herbarium specimens from six plant species from the genera Hordeum and Oryza collected from the Americas and Eurasia over 220 years. Using standardized laboratory protocols and shotgun sequencing, we quantified DNA degradation and elucidated factors that accelerate it. We find significant age-dependent DNA fragmentation rates, indicating temporal degradation processes not detected in prehistoric samples. In our analysis, DNA decay rates in herbarium specimens were almost eight times faster than in moa bones, reflecting fundamental differences in tissue composition and preservation environments. Environmental conditions at the time of specimen collection emerged as the major determinants of post-mortem damage rates, with the interaction term between temperature and genus being the dominant driver of cytosine deamination. We find no effect of sample storage on DNA damage and degradation. These findings provide insights into how climatic origin, preservation environment, taxonomic identity and age influence DNA preservation while highlighting opportunities for improving institutional preservation practices. Due to standardised preservation conditions, museum collections can provide better insights into DNA damage and degradation over time than archaeological and paleontological samples.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giag026), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 2:
  
  Reproducibility report for: Patterns of aDNA Damage Through Time end Environments - lessons from herbarium specimens Journal: Gigascience ID number/DOI: GIGA-D-25-00447 Reviewer(s): Laura Caquelin, Department of Clinical Neuroscience, Karolinska Institutet, Sweden [Wrote the report and reproduced the results] Gustav Nilsonne, Department of Clinical Neuroscience, Karolinska Institutet, Sweden [Reviewed the final report]
  
  Summary of the study The authors evaluated DNA preservation in herbarium collections by analyzing genomic data from 573 specimens of Hordeum and Oryza. They quantified DNA degradation and identified factors affecting decay, finding that specimen age and environmental conditions strongly influence DNA preservation.
  
  Scope of reproducibility
  
  According to our assessment the primary objective is: the regression analyses of aDNA damage metrics for Hordeum and Oryza.
  
  Outcome: "Four metrics were selected to quantify patterns of aDNA damage: (i) the proportion of endogenous DNA content, (ii) the fragment length distribution, (iii) the damage fraction per site (λ), and (iv) the frequencies of 5' C>T substitutions." (lines 197-199)
  
  Analysis method outcome: "The four metrics were analysed in linear models as a function of collection year and sample age using the 'lm' function in R" (lines 199-200)
  
  Main result: The results of this outcome are presented in figure 2 "Regression analyses of aDNA damage metrics for Hordeum and Oryza" and in the related text lines 302 to 361 in the "Regression analysis" section: "Endogenous fraction […] The regression analyses revealed no statistically significant relationship between the proportion of endogenous DNA and the sample collection year in Hordeum (R2 = 0.003, p = 0.451, N = 211), but a very weak yet significant relationship was observed in Oryza (R2 = 0.04, p = 0.00167, N= 245; figure 2a).
  
  Fragment length […] We observed a statistically significant relationship between the log-mean fragment length and the sample collection year for both genera (figure 2b), with a stronger relationship for Hordeum (R2 = 0.27, p =5.33 x 10-16, N=211) than Oryza (R2 = 0.112, p = 8.58 x 10-8, N= 245).
  
  Damage fraction per site (λ) and DNA decay rate (k) […] We estimated the DNA decay rate per year (k) for Hordeum and Oryza from the slope of the linear relationship between λ and sample age (figure 2c). We observed a per nucleotide decay rate of k= 2.64 x 10-4 per year for Hordeum (R2 = 0.208, p =3.27 x 10-12, N= 211), which was 1.5 times faster than the decay rate of Oryza of k= 1.79 x 10-4 per year (R2 = 0.101, p = 3.65 x 10-7, N= 245) […].
  
  Nucleotide misincorporations […] (figure 2d), with Oryza starting from a higher baseline of damage when compared to Hordeum and displaying a stronger relationship (R2 = 0.303, p = 8.62 x 10-21, N= 245 for Oryza, and R2 = 0.207, p =3.63x 10-12, N= 211 for Hordeum, respectively). […]"
  
  Availability of Materials a. Data
  
  Data availability: Raw data are not yet publicly available but uploaded in NCBI database. Processed data are shared via the private journal dropbox, and the intermediate file is available on the GitHub repository.
  
  Data completeness: Complete processed data and intermediate file (all data necessary to reproduce main results are available).
  
  Access Method: Private journal dropbox and GitHub repository
  
  Repository: https://github.com/Stefano-Porrelli/Herbaria_aDNA_Damage -Data quality: Structured b. Code
  
  Code availability: Open
  
  Programming Language(s): R and Bash
  
  Repository link: https://github.com/Stefano-Porrelli/Herbaria_aDNA_Damage
  
  License: MIT license
  
  Repository status: Public
  
  Documentation: Clear Readme file. Additional details may be required to run the Bash code.
  
  Computational environment of reproduction analysis
  
  Operating system for reproduction: MacOS 15.7.2
  
  Programming Language(s): R
  
  Code implementation approach: Using shared code
  
  Version environment for reproduction: R version 4.5.1/RStudio 2025.05.1
  
  Results
  
  5.1 Original study results - Results 1: See screenshot figure 2:
  
  5.2 Steps for reproduction
  
  -> Run 01_Plant_aDNA_screening_prep.sh - Issue 1: The reviewer link provided for the bioprojects on NCBI did not allow downloading. -- Partial resolution: An email was sent to the authors requesting access to the raw data or sharing processed data and intermediate files. Processed data were shared via the private journal dropbox and intermediate file (aDNA_damage_screening_MAIN.txt) was shared both on the dropbox and the GitHub repository.
  
  The authors contacted NCBI to enable downloading the raw data with the reviewer link, but no response has yet been received. As the review needed to be performed within a set timeframe, the computational reproducibility review was performed first using the processed data and then directly with the intermediate file.
  
  Note: The two bash scripts were not run. Additional guidelines would be helpful for running these scripts, especially regarding terminal commands and manual steps (changing the repository name or the link to the data for example).
  
  -> Run the analysis from the processed data shared --> Run code aDNA_Dmg_Script00_collate_screening_results.r - Issue 2: The code expects data organized in two sub-folders: 4_mapping and 5_aDNA_characteristics. Processed data were received in several species-specific folders, each containing 4_mapping and 5_aDNA_characteristics. -- Resolved: All data were merged manually into single 4_mapping and 5_aDNA_characteristics folders to match the script's requirements. This detail should be added to the readme file. - Issue 3: The sample_metadata.txt file was not correctly merged with the results dataframe. Many columns (Batch_no to X) in aDNA_damage_screening_MAIN.txt contained NA values. -- Resolved: A message was sent to the authors to resolve the issue. They updated both sample_metadata.txt and aDNA_damage_screening_MAIN.txt on GitHub. Author's response: "I have realised the problem stems from inconsistencies between sample naming conventions in the screening output directories and the sample identifiers in the metadata file. Specifically, for the Hordeum samples, the directories are named using library IDs rather than the short sample names, and some of the Oryza samples were missing their expected suffixes. This meant the left_join step failed to match metadata for those samples. Thank you for flagging this up. I have now corrected this by updating the "Sample" column in the metadata file to reflect the actual directory names used in the screening outputs. The original short names are preserved in a "Sample_ID" column. I have uploaded the corrected sample_metadata.txt file to the GitHub repository, and also updated the aDNA_damage_screening_MAIN.txt dataset on the GitHub repo to reflect these changes. I have re-run the pipeline and it now works correctly. Please let me know if you encounter any further issues, and thank you again for catching this."
  
  The reproduced aDNA_damage_screening_MAIN.txt file no longer contains NA values.
  
  --> Run code aDNA_Dmg_Script02_Regressions.r: The script was run without any issues.
  
  -> Run the analysis from the intermediate data file shared on Github --> Run code aDNA_Dmg_Script02_Regressions.r: Run the code after renaming the file to aDNA_damage_screening_MAIN_shared.txt.
  
  5.3 Statistical comparison Original vs Reproduced results - Reproduced results: -- Using the processed data and the reproduced aDNA_damage_screening_MAIN.txt, the results of Figure 2 were successfully reproduced (see screenshots below). -- Using the shared aDNA_damage_screening_MAIN.txt from GitHub, the results were also successfully reproduced (see screenshots below).
  
  Comments: Supplementary Figure 1 was also reproduced using the same code. We confirmed that the reproduced values match the original results. Both the processed data and the intermediate data file reproduced Supplementary Figure 1 (see screenshots below).
  
  Errors detected: One reporting error was detected in the "Fragment length" section (line 336): the p-value for Oryza should be 8.47 x 10-8, not 8.58 x 10-8 as reported in the text.
  
  Statistical Consistency: All statistical results reproduced from both the processed data and the intermediate file are identical to those reported in the manuscript (see Comparison_reproduced_vs_original.csv and Comparison_two_reproductions.csv files attached with this report).
  
  Conclusion
  
  Summary of the computational reproducibility review The computational reproducibility review shows that the results in Figure 2 and related text of the original study were fully reproducible using both the processed data and the intermediate data file shared (aDNA_damage_screening_MAIN.txt). The statistical results reproduced are identical to those presented in the manuscript. One minor reporting error was detected in the manuscript: the p-value for Oryza in the "Fragment length" section should be 8.47 × 10⁻⁸ instead of 8.58 × 10⁻⁸.
  
  Recommendations for authors -- Provide clearer instructions for running the Bash scripts, including terminal commands and any manual steps. -- Ensure consistent sample naming across metadata files and data directories to avoid merging issues for all analysis/scripts. -- Consider making raw data publicly available or provide clear guidance for reviewers to access it. -- Maintain clear documentation of file structure to facilitate future reproducibility.
3. GigaScience 10 Apr 2026
  
  in GigaScience
  
  AbstractHerbarium collections are a vast but underutilized resource for ancient DNA research, containing over 400 million specimens with detailed metadata and spanning centuries of global biodiversity. Understanding patterns of DNA preservation in natural collections is crucial for optimizing ancient DNA studies and informing future curation practices. We analysed genomic data for 573 herbarium specimens from six plant species from the genera Hordeum and Oryza collected from the Americas and Eurasia over 220 years. Using standardized laboratory protocols and shotgun sequencing, we quantified DNA degradation and elucidated factors that accelerate it. We find significant age-dependent DNA fragmentation rates, indicating temporal degradation processes not detected in prehistoric samples. In our analysis, DNA decay rates in herbarium specimens were almost eight times faster than in moa bones, reflecting fundamental differences in tissue composition and preservation environments. Environmental conditions at the time of specimen collection emerged as the major determinants of post-mortem damage rates, with the interaction term between temperature and genus being the dominant driver of cytosine deamination. We find no effect of sample storage on DNA damage and degradation. These findings provide insights into how climatic origin, preservation environment, taxonomic identity and age influence DNA preservation while highlighting opportunities for improving institutional preservation practices. Due to standardised preservation conditions, museum collections can provide better insights into DNA damage and degradation over time than archaeological and paleontological samples.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giag026), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 1:
  
  The manuscript by Stefano Porrelli and colleagues make a valuable contribution by scaling up previous work on DNA damage in plant herbarium specimens and by exploring how collection environments influence patterns of aDNA degradation. The authors present a large-scale analysis of DNA damage in 573 specimens from six Hordeum and Oryza species spanning ~220 years and diverse climates. Using standardized ancient DNA protocols, shotgun sequencing, and high-resolution climate data, they model the effects of specimen age, collection environment, genus, and herbarium of origin on DNA fragmentation, decay rates, and cytosine deamination.
  
  The study robustly confirms that DNA fragmentation and λ are strongly age-dependent, that herbarium specimens exhibit decay rates intermediate between bones and arthropods, and that environmental factors (particularly temperature) appear to correlate with 5′ C→T damage when all samples are analysed together. At the same time, some aspects of the temperature interpretation, especially in relation to genus-level structure, merit further clarification (as detailed below). Storage conditions (herbarium identity) seem to have comparatively minor influence.
  
  Overall I enjoyed reading this research, the dataset is rich, the methodological framework is strong, and the work has significant potential to become a reference for understanding plant aDNA preservation in herbaria. I believe the paper merits publication, though several concerns should be addressed prior to its acceptance. Please, find bellow several points that I hope will help strengthen and refine the manuscript.
  
  Major comments
  
  Definition and calculation of endogenous DNA fraction
  
  You define endogenous fraction as "the percentage of post-quality trimmed and merged reads for each sample mapped to its respective reference" (lines 203-206) and say it "was calculated with SAMtools 'flagstat'" (line 206) However, this is somewhat ambiguous:
  
  Is the denominator the number of merged reads after AdapterRemoval, the total raw reads, or only non-duplicate mapped reads?
  
  Do you include secondary/supplementary alignments (multi mappers), and how are PCR duplicates treated here?
  
  Given that endogenous fraction is one of your four key metrics (Methods, lines 197-200), it would be useful to make this completely explicit.
  
  Need a better explanation of the "month of collection" variable
  
  Lines 266-273: you state that monthly temperature and precipitation were extracted "to infer climatic conditions at the time of specimen collection" and that in the collection climate model variables were assigned "based on their location and month of collection." Later, in the Results you again refer to "collection climate" and "annual climate" models (lines ~438-441).
  
  However, it is not entirely clear whether month is explicitly included as a variable (e.g. as a categorical factor or via the corresponding monthly raster) or whether you simply used the CHELSA monthly layer corresponding to the recorded month? Please clarify in the Methods how the month of collection enters the model. Is there a variable "month" per se, or is the only effect that you choose the relevant tas_XX and pr_XX layer?
  
  This would make it much easier for readers to follow how "month" is used and what the collection climate actually represents.
  
  Need a clarification of "Collection Climate" vs. Herbarium Storage
  
  In the Methods (lines ~271-274), you describe a collection climate model where "monthly climatic variables (temperature and precipitation) were assigned to samples based on their location and month of collection," and an annual climate model based on annual means at the collection location. However, it is not clearly stated how this model relates to the actual time each specimen spent in the field vs. in herbarium storage. By definition, a 150-year-old specimen will have spent the majority of its lifetime in a collection, yet the climate used in the models is that of the collection locality at the time of sampling, not the climate of the herbarium building where it spent decades, despite the herbarium being included as a factor.
  
  Could you please clarify explicitly what period of a specimen's "life after death" you intend to capture with the collection climate model? Is it mainly the drying/early post-mortem period, or are you also considering longer-term storage conditions in the herbarium?. Do you assume that most deamination and oxidative damage occur in the first days to months after collection, and that later storage in relatively stable herbarium conditions contributes little to further degradation?
  
  Need for the integration of non-deamination mismatch controls and baseline divergence
  
  Your analysis focuses on the aDNA-typical 5′ C→T misincorporations (Methods, lines 238-245; Results, lines 355-361). However, you do not show any other mismatch frequencies (e.g. A→G, G→A etc) as a "negative control" to demonstrate that the patterns you report (exponential decay, climate, age, genus effects) are specific to deamination rather than general elevation of error rates or mapping artefacts.
  
  On that specific point, Lines 622-624 and 651-653: You attribute the higher 5′ C>T frequencies in Oryza to greater susceptibility to post-mortem deamination, potentially linked to its tropical and sub-tropical distribution. However, because Oryza originates from consistently warmer regions while Hordeum is predominantly temperate, genus and temperature are strongly confounded in your dataset. This is also supported by your own variance partitioning analysis, where large shared variance fractions (temperature × genus) indicate that these two predictors are difficult to disentangle.
  
  Furthermore, Figure 6 shows that when analysing each genus separately, the relationship between either annual mean temperature or collection temperature and 5′ C>T frequencies is no longer significant. This suggests that the global temperature-damage correlation you report is largely driven by genus-level differences rather than temperature acting independently or am I wrong ? Otherwise could you add a bit of discussion on that point to explain why if temperature does have an impact of deamination, why do we not see this intra-genus with different temperature values?
  
  While I agree that environmental conditions at the time of collection may influence DNA degradation, another factor that could contribute to the observed genus-specific patterns is reference-read divergence. Indeed, in a recent unreviewed work (see preprint: https://doi.org/10.1101/2025.07.16.665190), showed that the percentage identity between the reference genome and the ancient reads can influence apparent damage estimates. Although divergence between the ancient Hordeum/Oryza reads and their respective references is unlikely to be extreme given that plants do not evolve as rapidly as microbial taxa, a sanity check (e.g., adding the percentage average identity of each species per genus in the model) would help confirm that reference mismatch is not inflating differences in estimated 5′ C>T frequencies between genera.
  
  Minor comments
  
  Title : "Patterns of aDNA Damage Through Time end Environments" → "Time and Environments."
  
  Line 95 - ex situ in italic.
  
  Line 140 and elsewhere: Oryza should be in italics whenever used as a genus (same for Hordeum).
  
  Line ~551: "extremally well-preserved samples" → "extremely well-preserved samples."
  
  It may help to add one sentence acknowledging that classical laboratory negative controls (blank extractions) are not relevant to the regression models, but that misincorporation spectra and MapDamage profiles effectively serve as authenticity checks (Methods, lines 176-187 and 238-247).
  
  Discussion lines 641-648 compare herbarium specimens to bones and arthropods. It might help the reader if you add one explicit sentence summarizing why age-fragmentation relationships are detectable in herbaria but not in bones (standardized post-collection environment, as you nicely explain in lines 595-603).
  
  In Figure 6, consider adding a brief note in the legend stating that the strong relationship in panels a-b is largely driven by contrasting climates and baseline damage between genera, and that it disappears within genera (c-d). This would remind readers of the confounding you discuss in the text.
  
  In the Methods you state that you used linear models (lm) for regressions and varpart + rda for variance partitioning (lines 197-201 and 269-281). While the overall approach is reasonable, it would help to briefly address whether model assumptions (normality, homoscedasticity) were checked for the linear regressions (e.g. on log-transformed variables).
  
  While the manuscript mentions storage effects in the discussion, it doesn't explore them in great detail. More focus on specific herbarium storage methods (e.g., temperature, humidity control) might help contextualize the minor storage effects observed. A brief section or discussion on institutional preservation practices and their variability could provide readers with more context about herbarium differences.
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2025.10.26.684600v2
www.biorxiv.org www.biorxiv.org

Characterising a species-rich and understudied tropical insect fauna using DNA barcoding

3
1. GigaScience 10 Apr 2026
  
  in GigaScience
  
  AbstractBackground West Africa has high biodiversity that is relatively understudied, especially for insects. Studies of West African arthropod diversity can therefore help address important questions regarding conservation, ecosystem services, and insecticide use and other species-control interventions in agriculture and disease management. We intensively sampled arthropods in Ghana using complementary trapping methods, generated DNA barcodes, and classified sequences by Barcode Index Numbers (BINs, a species proxy). Using this dataset, we investigate assemblage composition, temporal activity patterns, and the state of regional biodiversity sampling.Results Sequencing DNA from 95,996 individuals captured using Malaise, yellow pan, pitfall, Heath and Centre for Disease Control (CDC) traps, we identified 10,120 unique BINs. The rate of species accumulation did not approach an asymptote for any taxonomic group or trap type, indicating high biodiversity. The different trap types sampled different subsets of the local community, with greatest similarity between yellow pan and pitfall traps. More insects and species (BINs) were trapped during the day than at night. Our dataset shared more BINs in the Barcode of Life Database with South Africa than with any other country, although this predominantly reflects the limited sampling and DNA sequencing campaigns in Africa.Conclusions This study more than doubles the published BINs for West Africa, offering insights into the biodiversity of an ecologically important but understudied taxon and region. Using multiple trap types allowed a more complete assessment of the local arthropod assemblage. The public release of these data will support and stimulate further taxonomic and ecological work in the region.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giag028), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 3:
  
  This paper describes a massive DNA barcoding project of arthropods in Ghana, West Africa with a dataset of 95,996 individuals and 10,120 BINs (Barcode Index Numbers). The publication is a major contribution to characterizing biodiversity of tropical insects in a poorly studied area, answering methodological questions concerning trap complementarity and temporal activity, and is also an invaluable resource to the public. The research is well structured, analyses are favorable and the manuscript is well written. I recommend acceptance after minor revisions to address a few clarifications and technical points.
  
  The manuscript acknowledges that only a subset of individuals was sequenced due to logistical constraints, and for Heath traps, selection was based on wet mass. While the authors argue that sub-sorting aimed to maximize diversity, this could still introduce biases in abundance estimates and BIN accumulation curves. Please include a brief discussion of how this sub-sampling might affect the conclusions (e.g., richness estimates, trap comparisons) and consider adding a sensitivity analysis in the supplement if feasible.
  
  The finding that South Africa shares the most BINs with Ghana despite geographic distance is interesting and attributed to sampling effort. However, the regression model explains only 3% of variance (R2=0.03), suggesting other factors may be at play. Please discuss potential biogeographic or ecological reasons (e.g., similar habitats, historical connectivity) that might contribute to this pattern, even if sampling effort is the dominant driver.
  
  The use of BINs as a species proxy is appropriate for this study, but the manuscript should briefly acknowledge known limitations (e.g., BINs may over- or under-split species, particularly in poorly studied taxa). A sentence or two in the Discussion would suffice, noting that BINs are a pragmatic tool for biodiversity assessment but not a replacement for formal taxonomy.
  
  Line 381: "insFect" should be "insect".
  
  able 1 and Table 2 are well-presented, but consider adding a footnote explaining that "BINs unique to trap type" means not found in other trap types in this study.
  
  Line 140: Specify the soap concentration used in pan and pitfall traps.
  
  Line 150: Clarify how "wet mass" was measured (precision, handling protocol).
  
  Line 156: Mention the success rate of PCR and sequencing (how many samples failed?).
  
  Line 360-379: The section on "Taxa of potential human importance" is interesting but could be strengthened by relating findings to local agricultural or health contexts. For example, what do the low numbers of crop pests or disease vectors imply for local management?
  
  Line 390-396: The conclusion could briefly highlight future directions, e.g., integrating morphological taxonomy with BINs, or using this dataset for metabarcoding studies.
  
  Line 228: "Neuroptera had the lowest completeness at 13.5%" - mention the sample size for this order.
  
  Line 302: "β = -1.92, p >0.05" - report the exact p-value. Transfer Authorization
2. GigaScience 10 Apr 2026
  
  in GigaScience
  
  AbstractBackground West Africa has high biodiversity that is relatively understudied, especially for insects. Studies of West African arthropod diversity can therefore help address important questions regarding conservation, ecosystem services, and insecticide use and other species-control interventions in agriculture and disease management. We intensively sampled arthropods in Ghana using complementary trapping methods, generated DNA barcodes, and classified sequences by Barcode Index Numbers (BINs, a species proxy). Using this dataset, we investigate assemblage composition, temporal activity patterns, and the state of regional biodiversity sampling.Results Sequencing DNA from 95,996 individuals captured using Malaise, yellow pan, pitfall, Heath and Centre for Disease Control (CDC) traps, we identified 10,120 unique BINs. The rate of species accumulation did not approach an asymptote for any taxonomic group or trap type, indicating high biodiversity. The different trap types sampled different subsets of the local community, with greatest similarity between yellow pan and pitfall traps. More insects and species (BINs) were trapped during the day than at night. Our dataset shared more BINs in the Barcode of Life Database with South Africa than with any other country, although this predominantly reflects the limited sampling and DNA sequencing campaigns in Africa.Conclusions This study more than doubles the published BINs for West Africa, offering insights into the biodiversity of an ecologically important but understudied taxon and region. Using multiple trap types allowed a more complete assessment of the local arthropod assemblage. The public release of these data will support and stimulate further taxonomic and ecological work in the region.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giag028), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 2:
  
  General Comments:
  
  This manuscript presents an impressive and highly valuable study that significantly advances our understanding of tropical arthropod diversity in West Africa. The sampling effort is extraordinary (nearly 100,000 individuals sequenced), and the dataset generated more than doubles the number of Barcode Index Numbers (BINs) publicly available for the region. The study is well-designed, employing multiple complementary trap types to capture diverse components of the arthropod community. The analyses are generally robust and appropriate for the research questions. The public release of this large dataset is a major contribution that will undoubtedly stimulate further taxonomic and ecological research in understudied tropical regions. The manuscript is clearly written and well-structured. I am generally in favour of acceptance after minor revisions.
  
  Specific Comments and Suggestions for Revision: 1. Visual Documentation of Methods The manuscript would benefit from including representative photographs of each of the five trap types (Malaise, yellow pan, pitfall, Heath, CDC) as deployed in the field. This is particularly helpful for readers less familiar with entomological methods. Given potential space constraints in the main text, I recommend including these as a Supplementary Figure (e.g., a panel of five photos with concise captions). Please cite this figure in the Methods (Sampling) section. 2. Robustness of Community Composition Analyses. The NMDS and PERMANOVA results convincingly show differences among trap types. However, the sequencing effort (and thus sample size) varied greatly among traps (e.g., Heath: 65,293 samples vs. CDC: 3,039 samples). Could the authors please clarify if the Bray-Curtis dissimilarity matrices used in these analyses were calculated on standardized or rarefied data to account for this large disparity in sample size? A brief note in the Methods (Data analyses) or figure legend would assure readers that the observed patterns are not primarily an artefact of sampling intensity. The finding of significantly higher diurnal catches (individuals and BINs) in Malaise traps is interesting. The discussion briefly mentions variance in thermal conditions. Could the authors expand the Discussion (Diurnal activity patterns) to include other potential ecological or methodological explanations? For example, might this reflect true peaks in flight activity for dominant taxa (Diptera, Hymenoptera), or could it be influenced by trap visibility or wind patterns differing between day and night? A sentence or two of speculation would enrich the interpretation. The authors transparently note that only 34 of 117 Malaise lots were fully sequenced and that spiders were removed from some analyses. In the Discussion, please add a short statement evaluating how these practical limitations might have influenced the key conclusions regarding trap complementarity and overall community completeness. For instance, does the high rate of BIN accumulation in Malaise traps (Supplementary Figure 6) suggest that sequencing the remaining lots might have yielded many additional unique BINs, potentially altering the estimated contribution of this trap type? 3. Minor Editorial and Clarity Points: Line 381: There is a typo: "more insFect individuals" should be "more insect individuals". Figure 2 & 3 Citations in Text: The in-text citations for Figures 2 and 3 (e.g., lines 239, 274-277) are currently embedded in the legend descriptions copied from the PDF. These should be simplified to standard figure calls (e.g., "(Figure 2)", "(Figure 3A, B)") and the legend text removed from the main manuscript body.
3. GigaScience 10 Apr 2026
  
  in GigaScience
  
  AbstractBackground West Africa has high biodiversity that is relatively understudied, especially for insects. Studies of West African arthropod diversity can therefore help address important questions regarding conservation, ecosystem services, and insecticide use and other species-control interventions in agriculture and disease management. We intensively sampled arthropods in Ghana using complementary trapping methods, generated DNA barcodes, and classified sequences by Barcode Index Numbers (BINs, a species proxy). Using this dataset, we investigate assemblage composition, temporal activity patterns, and the state of regional biodiversity sampling.Results Sequencing DNA from 95,996 individuals captured using Malaise, yellow pan, pitfall, Heath and Centre for Disease Control (CDC) traps, we identified 10,120 unique BINs. The rate of species accumulation did not approach an asymptote for any taxonomic group or trap type, indicating high biodiversity. The different trap types sampled different subsets of the local community, with greatest similarity between yellow pan and pitfall traps. More insects and species (BINs) were trapped during the day than at night. Our dataset shared more BINs in the Barcode of Life Database with South Africa than with any other country, although this predominantly reflects the limited sampling and DNA sequencing campaigns in Africa.Conclusions This study more than doubles the published BINs for West Africa, offering insights into the biodiversity of an ecologically important but understudied taxon and region. Using multiple trap types allowed a more complete assessment of the local arthropod assemblage. The public release of these data will support and stimulate further taxonomic and ecological work in the region.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giag028), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 1:
  
  The manuscript "Characterising a species-rich and understudied tropical insect fauna using DNA Barcoding" by Hemprich-Bennett and co-authors provides DNA barcodes from 95,996 individuals sampled in Ghana using various trap systems. In total, 10,120 unique BINs were identified, including 4,939 that were newly generated. Most sampled taxa were Diptera, Coleoptera, and Lepidoptera. In addition, the authors compared the determined BINs with already published data at BOLD, revealing the greatest overlap in BIN sharing with South Africa. In my eyes, the topic of this manuscript is interesting and for suitable for a publication in "GigaScience" that is focusing on "big data" research. The amount of new sequence data for arthropods, in particular insects, is awesome and represents an important step to assess the (molecular) biodiversity, or better species diversity, of a super diverse region which has hardly been studied so far. The authors use state-of-the-art methods to analyze their data including the BOLD database and BIN approach. However, there are some points that should be added or discussed in a broader context (see below). In addition, please find some specific comments made via sticky notes on the PDF file of the manuscript.
  
  I feel that the authors should provide some more references on various topics, especially in the introduction but discussion, too.
  
  It would be nice to present some maps, photos of the collection sites, the sampling devices as well as the samples themselves as part of the main manuscript, documenting the efforts that were taken.
  
  A BIN does per se not represent a species, because the variability of the DNA barcode fragment and mitochondrial DNA in general can be affected by various effects, e.g., incomplete lineage sorting, Wolbachia infections (especially true for arthropods), phylogeographic events, hybridization, and others. As consequence, BIN sharing and splitting can be observed - and in fact such effects are more often found than expected. It is fully clear that such analysis cannot be done for the given dataset, but a discussion of these effects is important and has been lacking thus far.
  
  What happened with the vouchers and DNA extracts? It is obvious that the collected specimens will include a high number of undescribed species, therefore the deposition of the voucher specimens is highly important.
  
  In my eyes it would be interesting to provide a summary of the lengths of the barcodes that were studied. How many barcodes were complete with a length of 658 base pairs? How many were about 300 bp etc.? I think such analysis can be easily done and visualized.
  
  Please find some other specific suggestions for corrections or additions made via notes on the document file of the manuscript.
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2025.09.30.675770v1
www.biorxiv.org www.biorxiv.org

MADRe: Strain-Level Metagenomic Classification Through Assembly-Driven Database Reduction

2
1. GigaScience 10 Apr 2026
  
  in GigaScience
  
  AbstractStrain-level metagenomic classification is essential for understanding microbial diversity and functional potential, but remains challenging, par- ticularly in the absence of prior knowledge about the composition of the sample. In this paper we present MADRe, a modular and scalable pipeline for long-read strain-level metagenomic classification, enhanced with Metagenome Assembly-Driven Database Reduction. MADRe com- bines long-read metagenome assembly, contig-to-reference mapping reas- signment based on an expectation-maximization algorithm for database reduction, and probabilistic read mapping reassignment to achieve sensi- tive and precise classification. We extensively evaluated MADRe on sim- ulated datasets, mock communities, and a real anaerobic digester sludge metagenome, demonstrating that it consistently outperforms existing tools by achieving higher precision with reduced false positives. MADRe’s de- sign allows users to apply either the database reduction or read classi- fication step individually. Using only the read classification step shows results on par with other tested tools. MADRe is open source and pub- licly available at https://github.com/lbcb-sci/MADRe.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giag030), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 2:
  
  This manuscript presents MADRe, a modular pipeline for strain-level metagenomic classification from long-read data, emphasizing an assembly-driven database reduction strategy coupled with probabilistic reassignment. The work is methodologically sound and well aligned with the scope of GigaScience. However, the study can be benefit from the following revisions:
  
  1, the study's main contribution is engineering and integration, rather than a fundamentally new statistical model. The authors thus should explicitly mention this in the Abstract as well as the Discussion part.
  
  2, although comparisons are reasonable, the manuscript could do more to clarify how MADRe compares against state-of-the-art strain-resolved tools under identical parameter tuning, and whether performance gains are consistent across different strain divergence levels.
  
  3, when comparing with existing tools, improvements appear primarily in precision, while recall trade-offs are less emphasized. The authors should explicitly discuss precision-recall trade-offs and clarify in which biological scenarios MADRe is most advantageous.
  
  4, While database reduction is presented as efficient, the computational cost of assembly plus EM iterations is not deeply analyzed. The authors should include a concise runtime/memory comparison or at least a qualitative discussion of computational trade-offs.
  
  5, The approach implicitly assumes that metagenome assembly is sufficiently accurate and representative. However, in highly complex or low-coverage samples, assembly could be fragmented or biased. The authors should add a clearer discussion on the sensitivity to assembler choice and parameters.
2. GigaScience 10 Apr 2026
  
  in GigaScience
  
  AbstractStrain-level metagenomic classification is essential for understanding microbial diversity and functional potential, but remains challenging, par- ticularly in the absence of prior knowledge about the composition of the sample. In this paper we present MADRe, a modular and scalable pipeline for long-read strain-level metagenomic classification, enhanced with Metagenome Assembly-Driven Database Reduction. MADRe com- bines long-read metagenome assembly, contig-to-reference mapping reas- signment based on an expectation-maximization algorithm for database reduction, and probabilistic read mapping reassignment to achieve sensi- tive and precise classification. We extensively evaluated MADRe on sim- ulated datasets, mock communities, and a real anaerobic digester sludge metagenome, demonstrating that it consistently outperforms existing tools by achieving higher precision with reduced false positives. MADRe’s de- sign allows users to apply either the database reduction or read classi- fication step individually. Using only the read classification step shows results on par with other tested tools. MADRe is open source and pub- licly available at https://github.com/lbcb-sci/MADRe.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giag030), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 1:
  
  I have no significant concerns with the MADRe methodology, and the current datasets provide sufficient evidence of its strain-level performance. However, several issues still need to be addressed.
  
  The reponse states: "However, we observed a limitation when Centrifuger cannot confidently assign a read to a specific reference sequence (for example, when multiple chromosomes belong to the same strain). In such cases, it often classifies the read under the NCBI strain-level taxid, which in some instances is identical to the species-level taxid. This makes it impossible to directly and fairly compare those classifications with other tools that operate at the sequence level."
  
  Although I agree this issue may not substantially affect the overall conclusions, the current handling of strain-level evaluation for Centrifuger is not sufficiently rigorous. The underlying problem is that Centrifuger (and Kraken2) rely on nodes.dmp and names.dmp, where the lowest taxonomic rank is often species or subspecies. As a result, these tools cannot report strain-level abundances directly in their standard output. A more appropriate solution would be to assign custom, unique strain-level taxIDs for all reference genomes, allowing proper classification at the strain level. This approach has been discussed in https://github.com/mourisl/centrifuger/issues/18 and https://github.com/jenniferlu717/Bracken/issues/113. Additionally, Centrifuger has an extra program, centrifuger-quant, that uses the EM algorithm to estimate abundance. The read assignment results produced by Centrifuger do not apply the EM algorithm.
  
  In the similarity experiment, some strains exhibit extremely high similarity, which makes proportional read distribution practically impossible for MADRe. To better characterize the performance limits of MADRe for accurate strain classification and abundance estimation, I recommend including additional simple synthetic mixtures at different combinations of similarity and coverage depth. Because long reads vary widely in length, read counts alone can be misleading. I strongly encourage reporting strain abundances rather than raw read counts, as abundances are more relevant for downstream applications. Finally, the authors should clarify whether MADRe's limitations in detecting low-abundance strains (referring more to low coverage) is entirely determined by the performance of the assembly tool, or whether additional factors influence this limitation.
  
  In Figure 4, please specify the sequencing technology used for sim_high. "calculated usin fastANI" →"calculated using fastANI".
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2025.05.12.653324v2
www.biorxiv.org www.biorxiv.org

Sex Chromosome Turnover and Structural Interspecific Genome Divergence Shapes Meiotic Outcomes in Hybridizing Cobitis

2
1. GigaScience 10 Apr 2026
  
  in GigaScience
  
  AbstractIt has been empirically established that genome mixing between divergent species can trigger meiotic aberrations, ultimately leading to the emergence of asexual reproduction through the production of unreduced gametes in various metazoan lineages. Yet, it remains poorly understood how such asexual hybrids cope with co-inherited differences in sex determination systems, diverged regulatory networks, and chromosomal incompatibilities— especially in the context of increased ploidy. Addressing these questions requires high-quality, chromosome-level reference genomes of the parental species involved in hybrid formation.Here, we present the first chromosome-level genome assemblies for three hybridizing Cobitis species (C. elongatoides, C. taenia, and C. tanaitica), providing a comprehensive framework to investigate the genetic and cytogenetic basis of hybrid sterility and the transition to asexuality. By integrating genome scaffolding, male/female pooled sequencing, and molecular cytogenetics, we uncover extensive structural variation among homologous chromosomes of the three species, despite their overall syntenic conservation.Population-level Pool-Seq analyses further revealed that each species possesses a distinct, non-homologous sex chromosome, highlighting sex chromosome turnover even among recently diverged lineages. These assemblies enabled the design of chromosome-specific painting probes, which we applied to meiotic metaphase I spreads of diploid hybrids. This approach revealed striking differences in the pairing success of orthologous chromosomes, with some (e.g., Ch01B) frequently forming bivalents, while others (e.g., Ch01A, Ch05, Ch20) failed to do so and remained unpaired.Our results demonstrate that chromosome-specific features, shaped by structural evolution and sex-linked divergence, contribute unequally to hybrid meiotic failure. Together, this work provides a high-resolution genomic and cytogenetic framework to understand how interspecific hybridization gives rise to clonality, and how the architecture of inherited parental genomes shapes the success or breakdown of meiosis in hybrid vertebrates.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giag031), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 2:
  
  This study presents the first chromosome-level genome assemblies for three hybridising Cobitis species (C. elongatoides, C. taenia, and C. tanaitica) to investigate the genomic and cytogenetic basis of hybrid sterility and the transition to asexuality. They provide large mount of integrated data including genome scaffolding, male/female pooled sequencing (Pool-Seq), and molecular cytogenetics, and found extensive structural variation among homologous chromosomes of the three species, despite overall karyotype conservation. They further used population-level Pool-Seq analyses further revealed that each species possesses distinct, non-homologous sex chromosomes. Overall, the analyses are comprehensive and results are solid, which is suitable to this journal. I have only several minor concerns. 1. The Background is too long, with many short paragraphs, you can short it with 4-5 paragraphs. 2. Methods: there is no Ethics statement, please add it. 3. Table 1, should be moved to supplementary files. 4. Figure 4 is not easy to see.
2. GigaScience 10 Apr 2026
  
  in GigaScience
  
  AbstractIt has been empirically established that genome mixing between divergent species can trigger meiotic aberrations, ultimately leading to the emergence of asexual reproduction through the production of unreduced gametes in various metazoan lineages. Yet, it remains poorly understood how such asexual hybrids cope with co-inherited differences in sex determination systems, diverged regulatory networks, and chromosomal incompatibilities— especially in the context of increased ploidy. Addressing these questions requires high-quality, chromosome-level reference genomes of the parental species involved in hybrid formation.Here, we present the first chromosome-level genome assemblies for three hybridizing Cobitis species (C. elongatoides, C. taenia, and C. tanaitica), providing a comprehensive framework to investigate the genetic and cytogenetic basis of hybrid sterility and the transition to asexuality. By integrating genome scaffolding, male/female pooled sequencing, and molecular cytogenetics, we uncover extensive structural variation among homologous chromosomes of the three species, despite their overall syntenic conservation.Population-level Pool-Seq analyses further revealed that each species possesses a distinct, non-homologous sex chromosome, highlighting sex chromosome turnover even among recently diverged lineages. These assemblies enabled the design of chromosome-specific painting probes, which we applied to meiotic metaphase I spreads of diploid hybrids. This approach revealed striking differences in the pairing success of orthologous chromosomes, with some (e.g., Ch01B) frequently forming bivalents, while others (e.g., Ch01A, Ch05, Ch20) failed to do so and remained unpaired.Our results demonstrate that chromosome-specific features, shaped by structural evolution and sex-linked divergence, contribute unequally to hybrid meiotic failure. Together, this work provides a high-resolution genomic and cytogenetic framework to understand how interspecific hybridization gives rise to clonality, and how the architecture of inherited parental genomes shapes the success or breakdown of meiosis in hybrid vertebrates.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giag031), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 1:
  
  The authors assembled the genomes of three Cobitis species native to Eurasia in an attempt to investigate the effects of structural variants on hybrid meiotic failure. This is certainly an interesting topic given the advances in our abilities to study hybridization that have been enabled by modern genomic sequencing methods, and the evolutionary consequences of asexually-reproducing species that result from rare instances of these hybrid events.
  
  Major comments: The introduction of the manuscript is well-written and focused on the topic at hand. Language was mostly clear throughout the manuscript. However, the paper overall is very lengthy and would benefit from extensive revision. Personally, I think the assembly and annotation of the three genomes is worthy of being a paper (genome report) on its own. Extraction of this material into a separate manuscript would allow the authors to hone the remainder of the paper into a much more concise and focused manuscript. Some aspects of the methods section related to genome assembly and annotation could be clarified and/or bolstered. Presentation of methods is mostly clear, but the description of genome annotation methods is a bit tough to follow. This procedure included many complicated steps and may benefit from a flow chart, even if included only as a supplemental figure.
  
  Several important quality control steps pertaining to genome assembly and DNA/RNA sequence processing were not mentioned. Authors do not report methods used for quality filtering or trimming. They do not report any process for removal of sequencing adapters. Additionally, they do not report screening of the genome assemblies for contamination from other species. These are critical steps in producing high-quality genome assemblies that need to be addressed.
  
  Presentation of statistics describing genome assembly quality, contiguity, and completeness could be improved. Authors might want to take some inspiration from statistics required for reporting in genome reports published by other journals, such as G3 or Genome Biology and Evolution. Sequencing depth is not reported in any context for the initial assemblies. Only log-transformed values are available in a single figure. Throughout the manuscript, authors conflate sequencing coverage (the proportion of a genome or genomic region that has been sequenced) with sequencing depth (the number of times a base or genomic region has been sequenced).
  
  For the sex-linked primers designed by the authors - I would recommend development of an internal positive control that would be expected to amplify in both sexes and be easily distinguishable from the sex-linked locus by size or fluorescent label. This allows the users to distinguish between failed PCRs and identification of the homogametic sex. This is especially important because the fish selected for marker development were collected from a relatively small portion of the species' distributions (Figure 1) so there could be population-specific differences that affect reliability of these markers for identifying sex. This is a problem I regularly encounter in my own work for wide-ranging species.
  
  I was also surprised that the authors did not conduct a GWAS analysis. That seems to be a fairly typical analysis included in studies of this type to elucidate sex-linked SNPs. It would add to an already extensive manuscript; however, this could add an additional argument for splitting this manuscript in two. It would provide more space to include it in a more focused manuscript.
  
  The results section contains many statements that would be more appropriate in the Methods section, or could be deleted entirely because they are redundant with statements already present in the Methods section. Additionally, there are some sentences that are more appropriate for inclusion in the Discussion section because they are interpretive. I have included examples under the 'Minor comments' section of this review. Some of the material presented as results in the Supplementary tables is presented in a confusing manner, and appears to contain errors (see examples in 'Minor comments' section below).
  
  The first several paragraphs of the Discussion section either repeat material already covered in the Results section, or go on tangents that are not directly related to the main purpose of the paper. However, some of it could be more appropriate to include in a genome report if the authors split the manuscript in two.
  
  Given the above issues, I find that the paper needs extensive editing and possibly more analytical work (if some of the methodological deficiencies were overlooked in the analysis phase as well as the writing phase of this project). It is unlikely this work could be accomplished in the normal window for a revision. Therefore, I regrettably suggest rejection of the manuscript.
  
  Finally, I have no meaningful experience with FISH probes or chromosomal painting so unfortunately, I can't provide much comment on those portions of the paper.
  
  Minor comments: Line 291: please provide specific version number for Hisat2 Line 319: version numbers for D-Genies and SyRI missing Line 331: version number for NGenomeSyn missing Line 439-440: Authors provide N50 values, but the paper would benefit from providing some additional metrics, such as N90 and L90, to help readers gauge the contiguity of these genomes. Line 442 - 443: I'm having a hard time understanding how the authors are calling these 'chromosome-level' assemblies when nearly a third (>30%) of the genome of two species (C. tanaitica and C. elongatoides) could not be assembled into chromosomal scaffolds. Line 457 - 458: Either the term 'topologically associated domains' is missing, or the authors need to remove the parentheses from around TADs if it was defined earlier in the manuscript. Line 470: change 'less' to 'fewer' Line 483 - 486: The statements that observed patterns of repeat families 'suggest' something are interpretive and should be moved to the discussion. Line 499 - 500: This sentence repeats content of the methods section. I suggest deleting it. Line 540 - 564: If I am understanding correctly, the discussion of 'coverage' here would be more accurately described as 'depth' since the authors seem to be talking about average sequencing depth in different areas of the genome. Furthermore, authors never provide untransformed measures of sequencing depth in any context (the initial genome assemblies, pool-seq data, re-sequenced individuals, etc.). Therefore, it is difficult to determine if the differences being discussed here are derived from data with enough statistical power to measure differences in sequencing depth between male and female fish. Lines 614 - 619: This could be explored with GWAS Lines 635 - 641: Much of this paragraph is a description of methods and belongs in the Methods section. Lines 664 - 667: Much of this is interpretive - more appropriate for the discussion. Lines 700 - 711: This paragraph has little or no relevance to the main topic of this paper (hybrid meiotic failure). Line 745: remove "loci's" Line 813 - 815: PMER was already defined earlier in the paper. Line 854: I suggest removal of "the first of their kind in an asexually reproducing vertebrate," because such statements rarely age well, and the concept behind the paper is interesting enough to stand on its own without pointing out the novelty of it being the 'first' time it was detected. References section: Capitalization of article titles varies from one reference to the next. Scientific names are sometimes italicized; other times they are not. Table 2: 'L50' and 'Number of Chromosomes' are always going to be integers. Why are there two significant digits to the right of the decimal point? Supplementary Figure S2: 'Cobitis' should be italicized. Supplementary Table S7: This table presents pre- and post-HiC values in a confusing manner that is nonsensical and probably erroneous. For example, the N50 values seem problematic. How do you have a 154 Kbp pre-HiC N50 contig value for C. elongatoides, but a 154 Mbp post-HiC N50 contig value for the same species? This is longer than the longest reported chromosome for any species (C. taenia) in Supplementary Table S8 (99 Mbp). Supplementary Table S10: I don't know what the percentages in line 33 refer to?
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2025.04.01.646337v1
www.biorxiv.org www.biorxiv.org

scDenorm: a denormalisation tool for integrating single-cell transcriptomics data

3
1. GigaScience 09 Apr 2026
  
  in GigaScience
  
  AbstractIntegrating single-cell omics data at an atlas scale enhances our understanding of cell types and disease mechanisms. However, the integration of data processed by different normalisation methods can lead to biases, such as unexpected batch effects and gene expression distortion, leading to misinterpretations in downstream analysis. To address these challenges, we present scDenorm, an algorithm that reverts normalised single-cell omics data to raw counts, preserving the integrity of the original measurements and ensuring consistent data processing during integration. We evaluated scDenorm’s performance on large-scale datasets and benchmarked its impact on data integration and downstream analysis across three datasets.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giag032), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 3：
  
  Reproducibility report for: scDenorm: a denormalisation tool for integrating single-cell transcriptomics data Journal: Gigascience ID number/DOI: GIGA-D-25-00209 Reviewer(s): Laura Caquelin, Department of Clinical Neuroscience, Karolinska Institutet, Sweden
  
  Context
  
  This report corresponds to a second assessment of the computational reproducibility of the article GIGA-D-25-00209, following a revision by the authors after the first round of review.
  
  The scope of the computational reproducibility review is to reproduce the results in figure 5f related to the evaluation of whether scDenorm improves the biological relevance of gene expression analyses by comparing GO term enrichment from differentially expressed genes (DEGs), before and after denormalization against a gold standard.
  
  Changes since the first review
  
  The authors made several changes based on comments from the initial computational reproducibility review: - Reorganized and updated the code in Fig5.ipynb and R_goanalysis.ipynb, - Created a docker environment, - Provided pre-computed GO enrichment results and intermediate files in Zenodo, - Added an environment.yaml file for python and installed_packages.csv file for R, - Improved the Readme file.
  
  Availability of Materials a. Data
  
  Data availability: Open
  
  Data completeness: Complete = all data necessary to reproduce main results are available
  
  Access Method: Repository
  
  Repository: https://zenodo.org/records/17275776 (new link) -Data quality: Completed, no metadata was shared.
  
  b. Code - Code availability: Open - Programming Language(s): R and Python - Repository link: https://github.com/rnacentre/scDenorm_reproducibility - License: - - Repository status: Public - Documentation: A Readme file is provided, but some improvements are needed.
  
  Computational environment of reproduction analysis
  
  Operating system for reproduction: MacOS 15.6.1
  
  Programming Language(s): R (jupyter notebook), Python (jupyter notebook)
  
  Code implementation approach: Using shared code
  
  Version environment for reproduction: Docker version 28.5.1, R version 4.5.1 (2025-06-13), Python 3.13.9
  
  Results
  
  5.1 Original study results - Results 1: In the revised version 1 of the paper , Figure 5 does not appear in the PDF. Therefore, we assumed that the figure is identical to the one in the original submission, especially based on the authors' comment stating that "We re-ran the analysis and obtained results consistent with those reported in the manuscript." Below is Figure 5f from the original paper:
  
  (See screenshot)
  
  The intermediate file "PBMC_go_analysis_result.csv" shared in Zenodo was used to run the authors' code and extract the numerical values of this graph, enabling direct comparison:
  
  (See screenshot)
  
  5.2 Steps for reproduction
  
  -> Follow the readme guidelines to set up the environnement: --> Download the notebooks from Github. Note: notebook list in readme is not updated. --> Install docker and jupyter. Note: the jupyter installation is not precised in the readme file. --> Download data. --- Issue 1: To download the data, no link was provided in the readme file in the Github repository. The zenodo link in the manuscript was not updated in the "Availability of Data and Materials" section. ---- Resolved: The new link was provided in the authors' response to the reviewer but needs to be added in the manuscript and the readme file. The link is https://zenodo.org/records/17275776. --- Issue 2: Guidelines in the README file do not correspond to the actual procedure. ---- Resolved: From the Zenodo archive, download scDenorm_reproducibility.tar.gz, unzip it, and place the data into the data folder. It would be clearer if the authors explicitly specified which files should be placed in the data directory to avoid confusion. --> Run the docker image. --- Issue 3: The following Docker instructions provided by the authors do not work as written: tar -xzf scdenorm_v0.tar.gz docker load -i scdenorm_v0.tar docker run -p 8888:8888 -v /path/to/scDenorm_reproducibility:/app scdenorm_v0 \ jupyter lab --ip=0.0.0.0 --no-browser --allow-root scdenorm_v0.tar.gz does not contain a standard Docker .tar image. After extraction, the result is a directory named scdenorm_v0, not a .tar file. docker load -i scdenorm_v0.tar fails because scdenorm_v0.tar does not exist. Docker must be running before executing docker load. The extraction step is sensitive to the current directory, but this is not documented. ---- Resolved: The image can be successfully loaded directly from the .tar.gz file using: docker load < scdenorm_v0.tar.gz After this, the image scdenorm_v0:latest is available.
  
  --- Issue 4: Two main issues appeared when running the docker run command: ----- "WARNING: The requested image's platform (linux/amd64) does not match the detected host platform (linux/arm64/v8)" ----- "mounts denied: The path /path/to/scDenorm_reproducibility is not shared from the host". ---- Resolved: To be able to use the docker run command, two steps were needed: ----- Share the project folder with docker manually: Docker → Preferences → Resources → File Sharing → add the local project path ----- Update the docker run command with the local path and add linux/amd64:
  
  docker run --platform linux/amd64\ -p 8888:8888\ -v /path/to /scDenorm_reproducibility:/app\ scdenorm_v0\ jupyter lab --ip=0.0.0.0 --no-browser --allow-root
  
  --- Issue 5: R was not connected to Jupyter. ---- Resolved: In the terminal, this made the R kernel available:
  
  R install.packages("IRkernel") IRkernel::installspec()
  
  -> Run the Fig5_R__goanalysis.ipynb script --- Issue 6: Docker image does not install the R packages. The file installed_packages.csv lists all required R packages, but they are not installed automatically. ---- Resolved: A solution was to install all required packages at the start of the notebook using the csv file: pkg_list <- read.csv("installed_packages.csv", stringsAsFactors = FALSE)
  
  for (pkg in pkg_list$Package) { if (!requireNamespace(pkg, quietly = TRUE)) { message(" Installing the package: ", pkg) tryCatch( { install.packages(pkg, dependencies = TRUE) }, error = function(e) { message("Failed to install package: ", pkg) } ) } else { message(" Already installed: ", pkg) } } Additional required packages from Bioconductor:
  
  if (!require("BiocManager", quietly = TRUE)) install.packages("BiocManager") if (!requireNamespace("enrichplot", quietly = TRUE)) { BiocManager::install("enrichplot", ask = FALSE)} if (!requireNamespace(c("enrichplot","org.Hs.eg.db"), quietly = TRUE)) { BiocManager::install(c("clusterProfiler", "org.Hs.eg.db"), ask = FALSE)}
  
  After these steps, the R script ran without errors.
  
  -> Run the Fig5.ipynb script --- Issue 7: The same issue as no. 3 occurred again, the docker image did not provide a working python environment. Attempt to create the python environment with environment.yaml file. conda env create -f environment.yaml Failed because many packages do not exist for the system, for exemple: "ipyw_jlab_nb_ext_conf ==0.1.0 py39h06a4308_1 does not exist (perhaps a typo or a missing channel);" These errors seem to happen because the environment file contains many Linux-specific packages. ---- Unresolved: Authors should provide an environment file working in all systems. A temporary solution was used: create a minimal clean environment: conda env create -f environment.yaml Environment.yaml: name: scdenorm_clean channels: - conda-forge - bioconda - defaults
  
  dependencies: - python=3.9 - numpy - pandas - scipy - matplotlib - seaborn - tqdm - scanpy - anndata - tables - pip
  
  pip:
  
  scdenorm
  
  SCCAF
  
  Then:
  
  conda activate scdenorm_clean conda install ipykernel python -m ipykernel install --user --name=scdenorm_clean --display-name "Python (scdenorm)"
  
  Select this kernel in Jupyter Notebook to run the python files.
  
  An additional issue was the conflict between matplotlib and scapy. Resolved with:
  
  conda install matplotlib=3.6.3 conda install -c conda-forge scanpy (Successfully installed scanpy-1.10.3)
  
  --> The script was executed only by starting from HSPC section. --- Issue 8: A specific issue appeared after filtering the dataframe tmp1 by go_terms, only two cell types remained (b0 and b1), and b1n disappeared. This was because no row corresponding to b1n matched the selected GO terms. ---- Unresolved: Fig5_R__goanalysis.ipynb was re-run multiple times to obtain a new version of the PBMC_go_analysis_result.csv. However, the error persists.
  
  5.3 Statistical comparison Original vs Reproduced results - Reproduced results: Figure 5f
  
  (see screenshots)
  
  Comments: The figure obtained does not show all go_terms nor all categories. Only categories b1 and b0 are shown.
  
  Errors detected: -
  
  Statistical Consistency: If there is no error, b0 would correspond to the gold standard and b1 to the before_scDenorm cell type. The -log10(adjusted p-value) values reproduced do not match the reported values.
  
  Conclusion
  
  Follow-up on previous recommendations: In the first round of review, we noted the following points: -- Add a requirement file that lists all the needed packages with their exact versions. Authors provided an installed_packages.csv which allowed to manually reconstruct the R environment. However, a functional environment.yaml is required. -- Make sure all data files needed to reproduce the figures are available in the repository. The authors updated the Zenodo link and uploaded all relevant intermediate files. -- Clearly explain which parts of the results may vary due to randomness in the model and how much variation users should expect. This point remains insufficiently addressed.
  
  Summary of the second computational reproducibility review
  
  Both scripts used to reproduce the figure 5f were executed, but several issues were encountered. The results obtained differ from the ones reported in the manuscript. In particular: -- Several p-values could not be reproduced, -- Some discrepancies appeared in the GO enrichment analysis. Some clarifications are required for the GO analysis about why some cell types are not present after filtering.
  
  Significant manual intervention was required, to improve the reproducibility, here is some new recommendations: -- Improve the readme file. The readme does not reflect the real procedure needed to reproduce the results (incorrect docker instructions, missing steps, outdated notebook list). Clear instructions should be added regarding: --- the required jupyter installation, --- file paths and folder structure, --- link to the zenodo --- how to run each notebook -- Provide a functional environment.yaml. The provided docker image fails to create the required Python and R environments.
2. GigaScience 09 Apr 2026
  
  in GigaScience
  
  AbstractIntegrating single-cell omics data at an atlas scale enhances our understanding of cell types and disease mechanisms. However, the integration of data processed by different normalisation methods can lead to biases, such as unexpected batch effects and gene expression distortion, leading to misinterpretations in downstream analysis. To address these challenges, we present scDenorm, an algorithm that reverts normalised single-cell omics data to raw counts, preserving the integrity of the original measurements and ensuring consistent data processing during integration. We evaluated scDenorm’s performance on large-scale datasets and benchmarked its impact on data integration and downstream analysis across three datasets.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giag032), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 1：
  
  The authors have addressed all of my comments. The manuscript is suitable for publication after formatting in accordance with the journal's regulations.
3. GigaScience 09 Apr 2026
  
  in GigaScience
  
  AbstractIntegrating single-cell omics data at an atlas scale enhances our understanding of cell types and disease mechanisms. However, the integration of data processed by different normalisation methods can lead to biases, such as unexpected batch effects and gene expression distortion, leading to misinterpretations in downstream analysis. To address these challenges, we present scDenorm, an algorithm that reverts normalised single-cell omics data to raw counts, preserving the integrity of the original measurements and ensuring consistent data processing during integration. We evaluated scDenorm’s performance on large-scale datasets and benchmarked its impact on data integration and downstream analysis across three datasets.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giag032), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 2：
  
  The authors have done a commendable job in addressing the concerns and making the tool accessible. The manuscript is now improved and I recommend it for publication.
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2025.05.10.653289v1
Mar 2026
www.biorxiv.org www.biorxiv.org

Inference of admixture in dogs from whole genome sequences

2
1. GigaScience 25 Mar 2026
  
  in GigaByte
  
  Editors Assessment:
  
  In this new methodological work researchers investigate the genetic structure and admixture patterns among dog breeds through a comprehensive analysis using whole genome sequencing data. A reference population was established comprising 349 individuals across 65 breeds, from which breed-informative single nucleotide polymorphisms (SNPs) were derived. Using the SCOPE algorithm previously employed in many global ancestry studies to estimate admixture proportions effectively, this demonstrated strong accuracy even at low sequencing depths (<1x). After peer review suggested changes to data processing the work was suitably solid to make some interesting findings using this approach. Results indicate that specific breeds, such as Catahoula Leopard Dogs and Greek Tracers, present unique challenges in admixture inference due to their genetic proximity to other breeds. With challenges in estimating Pit Bull Terrier ancestry/admixture, suggesting that there could be several genotypes associated with the Pit Bull Terrier breed . The methods provide a robust framework for future assessments of canine genetic diversity and health implications in canid populations. And processed reference population data is also available in the Github repository for reuse.
  
  This evaluation refers to version 1 of the preprint
  
  Summary
2. GigaScience 25 Mar 2026
  
  in GigaByte
  
  AbstractBackground Understanding the genetic architecture of domestic dogs provides unique insights into the processes of domestication, breed formation, and the genetic basis of complex traits and diseases. Dog populations, characterized by their diverse morphologies and behaviors, also exhibit extensive evidence of historical and ongoing admixture. This widespread mixing, driven by both natural migration and selective breeding practices, has profoundly shaped the genomic landscape of modern dog breeds. Though global admixture has been extensively estimated in human population studies, where the number of subgroups is typically limited, there has been more limited analysis in canines, where there may be dozens of ancestral groups, or breeds.Results Here we present a procedure for estimating global admixture in dogs from whole genome sequence data using SCOPE. We created a reference population of 65 dog breeds that included 349 individuals, from which we determined breed-informative SNPs. We demonstrate that SCOPE can accurately infer breed composition in both simulated and real admixed samples, even at low sequencing depths. We also characterized the genetic similarity between our reference dog breeds and recovered previously reported relationships.Conclusion This approach allows us to identify the strength of the genetic signature of breeds and place error bounds on admixture estimates. It also provides evidence that admixture can be accurately inferred in subjects that may originate from multiple ancestral populations.Competing Interest StatementMatteo Pellegrini is affiliated with ProsperK9, which developed a direct to consumer test for dog ancestry.
  
  This paper is now published in GigaByte, with the paper and peer reviews shared under a CC-BY license:
  
  https://doi.org/10.46471/gigabyte.173
  
  Reviewer 1. Professor.Tracy Smith
  
  Is the code executable? Unable to test.
  
  Is there a clearly-stated list of dependencies, and is the core functionality of the software documented to a satisfactory level?
  
  No. I would like to see full plink code too as well as R dependencies.
  
  See review here: https://gigabyte-review.rivervalleytechnologies.com/api/download-documents?payload=dLQpgSxI41Ksf6QFmEn6UrYcoBRhxttwE1cPXu8tMOJByVSthbzG5HM9e=CR3Vljzdt8InzTYtFxfQeD7116f6ET5X03k=la7k1ex9dyGXUGWQUtgC1E7llsk2kR3mX6mYDp
  
  Reviewer 2. Tatiana Feuerborn
  
  The premise of the paper could be an interesting test of the use of the SCOPE software on dogs. I can appreciate the idea of the manuscript and the profile journal selected for the submission, but even for the journal the intentions of the journal don't appear to align fully with the way the testing of the method was carried out. Additionally, the dataset of dog breeds is insufficient to be informative. This is particularly true of the number of mixed dogs tested and the down-sampling. Furthermore, any interpretation of the results lacks the observation of these limitations and the nuance of the geographical bias of the dataset.
  
  Reviewer comments: “Bergstrom et al. 2012” wrong citation “Global ancestry, inferred by tools such as SCOPE and ADMIXTURE (Alexander et al. 2009), attempts to infer the proportions of an individual’s genome that belong to an ancestral breed or group.”
  
  Citation for SCOPE missing If studies such as Parker et al 2017 have used 160 breeds and the authors have mentioned the numerous subpopulations of dogs, why did the authors choose to use such a small number of breeds for their study?
  
  Why were the top 2500 SNPs used? Why not 1000 or 10000, etc? Testing the number that are needed would be very informative.
  
  Figure 1, I would recommend sorting the breeds by value so that the results can be interpreted more easily. “We also note that certain groups of breeds tend to group together. For example Samoyed, Basenji, and Husky samples are found near each other on the UMAP. This group has been shown to represent ancient breeds (Larson et al. 2012, Pickrell and Pritchard 2012, Wojcik and Powierza 2021).” There are other explanations for this pattern, almost all of the other breeds examined are breeds of European origin, there is very little representation of non-European ancestry within the small sample size of dog breeds included in the study. “Despite this the more distant relationships between breeds differ from some of the previous studies, as these may be more difficult to define using our markers.” If this is the case and could be influencing results, it would be relevant to mention which breeds these are.
  
  Why is SNP chip data rather than whole genome sequencing being used for the study? This should be clearly established, any explanation for this choice is completely absent. Is it because SCOPE can only handle a small number of sites? Is it because of the availability of the dataset? If so, note my previous concern with the small sample size, despite the public availability of much larger datasets. Is there another reason?
  
  Figure 5, The size of the legend versus the figure itself is very unbalanced, and I would recommend making a clearer delineation of the breeds, it is unclear where one breed ends and the next begins. The size of the figure is also difficult to see the individuals with more than one bar colour. A continuous colour scheme is also probably the wrong choice for the plot, the already difficult delineation of the breeds is nearly impossible, given that the breeds are sorted alphabetically I know the colour choice is purely incidental thus making the continuous palate even more inappropriate.
  
  Figure 7, most of my comments on Figure 5 also apply to Figure 7. The colour choices make it very difficult to see how many segments are present in each bar. Also an indicator of which simulated individuals were determined to be successful versus unsuccessful would be helpful. For example the rightmost four bars look fairly unsuccessful to me as they are all missing a component in the estimate that was present in the truth. Using three mixed dogs seems like a very small number of samples to test the accuracy of the tool on real datasets. Downsampling only one individual to test the impact of coverage is likely not representative of the impact of low coverage across all breed compositions. A larger number of individuals downsampled would be more informative for the accuracy of the results. In the discussion in page 11, many old citations are used to back up the interpretation of the close relationship of Siberian Huskies, Basenjis, and other non-European breeds. No mention is made of the geographical bias of the dataset as a reason behind this as I previously mentioned.
  
  General comments: There are frequent issues with a lack of spaces ‘ ‘ between parentheses and neighbouring words and punctuation. A different colour palette should be used for the figures. It is very difficult to determine the breed due to the poor colour choice used throughout the manuscript. Inconsistent citation styles are used, eg. “Similarly, prior maximum likelihood estimation based techniques have suggested that Huskies and Samoyeds are both ancient breeds and related to Basenji (23, 29).”
Visit annotations in context

Tags

Summary

Annotators

GigaScience

URL

biorxiv.org/content/10.64898/2026.02.09.704954v1
www.biorxiv.org www.biorxiv.org

MIReVTD, a Minimum Information Standard for Reporting Vector Trait Data

2
1. GigaScience 20 Mar 2026
  
  in GigaScience
  
  AbstractVector-borne diseases pose a persistent and increasing challenge to human, animal, and agricultural systems globally. Mathematical modeling frameworks incorporating vector trait responses are powerful tools to assess risk and predict vector-borne disease impacts. Developing these frameworks and the reliability of their predictions hinge on the availability of experimentally derived vector trait data for model parameterization and inference of the biological mechanisms underpinning transmission. Trait experiments have generated data for many known and potential vector species, but the terminology used across studies is inconsistent, and accompanying publications may share data with insufficient detail for reuse or synthesis. The lack of data standardization can lead to information loss and prohibits analytical comprehensiveness. Here, we present MIReVTD, a Minimum Information standard for Reporting Vector Trait Data. Our reporting checklist balances completeness and labor- intensiveness with the goal of making these important experimental data easier to find and reuse, without onerous effort for scientists generating the data. To illustrate the standard, we provide an example reproducing results from an Aedes aegypti mosquito study.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giag020), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 2:
  
  I read with interest the manuscript as I wholeheartedly agree there is a strong need for harmonization on reporting quantitative measurements of vector traits, especially for the subsequent development of mathematical models. The paper is well written, and examples are very helpful, particularly the one shown in Figure 1, advocating for the need for the sharing of individual (possibly raw) observations. I have some very minor comments and suggestions. Given the broad readership of the journal, I feel the Introduction would benefit from some definitions of what the authors mean by vector and vector-borne diseases, with some examples (WNV, DENV, … up to you). It's not very clear to me how the authors' current proposal aligns with what already proposed in Wu et al. 2022 (ref 21). It seems like some sort of extension? Could you please further elaborate on this? Regarding latitude and longitude, I think also the coordinate reference system should be standardized (WGS, no UTM or others). You might provide some examples of online repositories (line 187). Some (like GitHub) might not be perpetually available, differently from (hopefully) others like Zenodo or the Supplementary Materials accompanying the paper. The latter might be preferrable in my opinion. Figure 1. Please provide the equation of the TPC. Please note that Figure 2 currently does not seem to be cited in the main text (perhaps it should be on line 248?). What does "Dataset: 572" mean? As currently VecTraits seem the best (and only?) example of what the authors are proposing, perhaps it should be mentioned in the Abstract as well.
2. GigaScience 20 Mar 2026
  
  in GigaScience
  
  AbstractVector-borne diseases pose a persistent and increasing challenge to human, animal, and agricultural systems globally. Mathematical modeling frameworks incorporating vector trait responses are powerful tools to assess risk and predict vector-borne disease impacts. Developing these frameworks and the reliability of their predictions hinge on the availability of experimentally derived vector trait data for model parameterization and inference of the biological mechanisms underpinning transmission. Trait experiments have generated data for many known and potential vector species, but the terminology used across studies is inconsistent, and accompanying publications may share data with insufficient detail for reuse or synthesis. The lack of data standardization can lead to information loss and prohibits analytical comprehensiveness. Here, we present MIReVTD, a Minimum Information standard for Reporting Vector Trait Data. Our reporting checklist balances completeness and labor- intensiveness with the goal of making these important experimental data easier to find and reuse, without onerous effort for scientists generating the data. To illustrate the standard, we provide an example reproducing results from an Aedes aegypti mosquito study.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giag020), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 1:
  
  The authors propose MIReVTD, a concise minimum-information checklist for reporting vector trait data, motivated by the lack of consistent terminology and metadata that impedes reuse and synthesis across studies. The scope and intent are clearly stated in the Abstract and Introduction, including the emphasis on FAIR principles and the illustrative Aedes aegypti example and VecTraits implementation. Overall, this is a timely, valuable contribution that complements MIReAD (arthropod abundance) and the vector competence minimum data standard, and it will be highly useful to both experimentalists and modellers.
  
  Major
  
  It would be highly beneficial to demonstrate the compatibility and added value of the MIReVTD and the VecTraits database to the existing initiatives aiming to collect and structure similar information. The authors mentioned ETS, MIAPPE, and MIReAD but and explicit mapping of the minimum information field alignment will help to place MIReVTD in context and facilitate adoption of this standard.
  
  The "Axes of Variation" section is strong, but it could be clearer about what constitutes a stressor or condition. It would help to list common confounders such as humidity, photoperiod, diet or food ration and quality, larval density, and light cycle, and to encourage recording fixed or background conditions in separate fields rather than only gradients. This would help avoid ambiguity between variables that are experimentally varied and those that simply describe the environment. In Figure 2, the second stressor appears to take a fixed value (0.1). This is somewhat confusing because it is not clear whether this field is meant for another gradient (e.g., temperature in the range of 20 to 40 °C in addition to food ration categories), or whether it lists fixed conditions under which the experiment was performed. If it is the latter, it might be more practical to include additional fields for stressors so that all relevant conditions, such as humidity and photoperiod, can be recorded. It would also help to clarify whether a third or further stressor can be added to the table, and how these would appear. It might in fact be preferable not to distinguish gradients from fixed conditions at all, and instead to treat them uniformly as conditions, each defined with its corresponding unit and uncertainty. This would simplify the structure and prevent confusion about whether a variable was held constant or systematically varied.
  
  It would highly improve usability and adoption if the standard also recommended ORCIDs for contributors, DOIs for datasets, and an explicit data license (e.g., CC BY/CC0). If this extension is possible, I recommend that the authors add a short "Data licensing & citation" paragraph to the Results section.
  
  Minor - Line 248: Fig. 2? - Please update the citation of bayesTPC. - If possible, please provide a code snippet with the data used (in Zenodo or as Supplementary Material) for Fig. 1. - I believe the followings are also relevant to this study and should be mentioned appropriately: - Adams B, Franz N, König-Ries B, et al. TraitBank: Practical semantics for organism attribute data. Semantic Web. 2015;7(6):577-588. doi:10.3233/SW-150190 - Kattge, J., Ogle, K., Bönisch, G., Díaz, S., Lavorel, S., Madin, J., Nadrowski, K., Nöllert, S., Sartor, K. and Wirth, C. (2011), A generic structure for plant trait databases. Methods in Ecology and Evolution, 2: 202-213. https://doi.org/10.1111/j.2041-210X.2010.00067.x
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2025.01.27.634769v1
www.biorxiv.org www.biorxiv.org

Comparative analysis of eccDNA and circRNA tools shows increased accuracy of tool combination

2
1. GigaScience 20 Mar 2026
  
  in GigaScience
  
  AbstractIntroduction Circular nucleic acids such as extrachromosomal circular DNA (eccDNA) and circular RNA (circRNA) are increasingly recognized for their biological relevance and potential as biomarkers in disease contexts. Despite their growing importance, their detection remains challenging due to tool-specific biases, limited validation frameworks, and high variability in performance across datasets.Methods We benchmarked 10 circle detection tools across diverse conditions using both simulated and biological datasets. Our evaluation included classical performance metrics and a novel internal measure of read distribution symmetry (ΔCJ) to assess circle prediction confidence. We explored the impact of sequencing protocols, filtering strategies, and combined tool consensus.Results We found that detection accuracy was highly influenced by sequencing depth, alignment algorithm, and experimental enrichment protocols. ΔCJ proved effective in flagging potential false positive circles, showing improved accuracy of Intersect (circles detected by all tools) and Rosette (circles detected by ≥ 2 tools) combinations.Discussion This study offers a broad evaluation of circular detection tools, suggesting that the combination of ≥3 tools is necessary for a correct prediction. These insights will inform future experimental design and data analysis pipelines in both experimental and clinical settings.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giag017), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 2:
  
  This manuscript presents a systematic and carefully executed benchmark of eccDNA and circRNA detection tools using both in silico simulations and biological datasets. The introduction of CircleSim and, in particular, the ∆CJ metric as a proxy for detection quality in the absence of ground truth is a notable conceptual contribution. The study is generally well designed, the analyses are extensive, and the conclusions are largely supported by the data. However, some points need to be addressed to strengthen the manuscript and avoid potential misinterpretation of the results. 1. Provide a concise table summarizing tool versions, aligners, key parameters et al. This would be helpful for readers attempting to replicate the benchmark. 2. CircleSlim is a useful contribution, but its biological realism requires clearer justification. Circles are generated uniformly across chromosomes and transcripts, yet real eccDNA and circRNA formation is known to be biased by chromatin state, transcriptional activity, repetitive elements, and genomic architecture. The authors should explicitly discuss which biological biases are not captured by CircleSim, and explain how this affects interpretation of precision/recall values. 3. The conclusion that higher sequencing coverage increases false positives is intriguing but potentially misleading if generalized. The observed decrease in F-score at high coverage appears driven by accumulation of low-confidence split reads, and tool-specific sensitivity to noise. The manuscript should clarify that high coverage per se is not intrinsically detrimental, but rather that current algorithms lack sufficient FP control at high depth without stricter filtering. Reframing this as a tool- and filter-dependent phenomenon would prevent misinterpretation.
2. GigaScience 20 Mar 2026
  
  in GigaScience
  
  AbstractIntroduction Circular nucleic acids such as extrachromosomal circular DNA (eccDNA) and circular RNA (circRNA) are increasingly recognized for their biological relevance and potential as biomarkers in disease contexts. Despite their growing importance, their detection remains challenging due to tool-specific biases, limited validation frameworks, and high variability in performance across datasets.Methods We benchmarked 10 circle detection tools across diverse conditions using both simulated and biological datasets. Our evaluation included classical performance metrics and a novel internal measure of read distribution symmetry (ΔCJ) to assess circle prediction confidence. We explored the impact of sequencing protocols, filtering strategies, and combined tool consensus.Results We found that detection accuracy was highly influenced by sequencing depth, alignment algorithm, and experimental enrichment protocols. ΔCJ proved effective in flagging potential false positive circles, showing improved accuracy of Intersect (circles detected by all tools) and Rosette (circles detected by ≥ 2 tools) combinations.Discussion This study offers a broad evaluation of circular detection tools, suggesting that the combination of ≥3 tools is necessary for a correct prediction. These insights will inform future experimental design and data analysis pipelines in both experimental and clinical settings.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giag017), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 1:
  
  The authors have adequately addressed all of my comments and concerns, and while there are future directions that should be explored (e.g., the effect of library prep on eccDNA detection, the effect of sequencing artifacts on eccDNA detection), I agree with the authors that those tasks are slightly outside the scope of their existing manuscript. Line 105-106 has a minor grammatical error. I have no further suggestions and recommend the manuscript for publication as the comparisons performed here will help people in the field understand what tools should be used.
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2025.07.14.664708v1
Feb 2026
www.biorxiv.org www.biorxiv.org

Enhanced semantic classification of microbiome sample origins using Large Language Models (LLMs)

3
1. GigaScience 24 Feb 2026
  
  in GigaScience
  
  AbstractOver the past decade, central sequence repositories have expanded significantly in size. This vast accumulation of data holds value and enables further studies, provided that the data entries are well annotated. However, the submitter-provided metadata of sequencing records can be of heterogeneous quality, presenting significant challenges for re-use. Here, we test to what extent large language models (LLMs) can be used to cost-effectively automate the re-annotation of sequencing records against a simplified classification scheme of broad ecological environments with relevance to microbiome studies, without retraining.We focused on sequencing samples taken from the environment, for which metadata is important. We employed OpenAI Generative Pretrained Transformer (GPT) models, and assessed scalability, time and cost-effectiveness, as well as performance against a diverse, hand-curated ground-truth benchmark with 1000 examples, that span a wide range of complexity in metadata interpretation. We observed that annotation performance markedly outperforms that of a baseline, manually curated, non-machine-learning keyword-based approach. Changing models (or model parameters) has only minor effects on performance, but prompts need to be carefully designed to match the task.We applied the optimized pipeline to more than 3.8 million sequencing records from the environment, providing coarse-grained yet standardized sampling site annotations covering the globe. Our work demonstrates the effective use of LLMs to simplify and standardize annotation from complex biological metadata.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giag015), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 3：
  
  The ability to reuse scientific data for secondary analysis is an extremely important topic. Since the promotion of the FAIR Guiding Principles a decade ago, the central importance of standards-adherent metadata has received considerable attention. Although this paper surprisingly doesn't mention the FAIR principles, the work is important in understanding what it takes to make datasets FAIR and "AI ready."
  
  A core problem with the paper is that it is unsure who its audience is. The paper is motivated by the needs to scientists to search for and reuse online datasets for secondary analysis, but much of the paper concerns highly technical issues that are related to fine tuning LLM performance. It is laudable that the manuscript annotates its discussion of the authors' methods with pointers to actual Python scripts that would allow third parties to replicate the authors' work. The detailed presentation, however, may make it hard for many readers to understand the computational strategy that all the scripts are implementing. The organization of the paper weaves from discussions of the ability of LLMs to extract scientific standards from "legacy" experimental metadata to details of how to enhance computational efficiency to make the use of LLMs more cost-effective. The title and abstract of the paper suggest that the authors are aiming for a more scientific audience, but much of the manuscript focuses on arcane implementation details that will be less important to such readers.
  
  Missing from the paper is a detailed discussion of what the metadata in SRA are really like. The reader never sees complete examples of the metadata that are processed in the authors' work, and thus it is hard to have intuition about the problem that the authors are trying to solve. In particular, the paper doesn't present information about the range of attributes in user-defined metadata fields in SRA. The paper would benefit from a discussion of the structure of scientific metadata in general, and of how the authors' work fits into the larger effort in the research community to make datasets FAIR. (Full disclosure: My own laboratory is involved in such activity. See https://arxiv.org/abs/2504.05307v2)
  
  The abstract of the paper states that the authors "test to what extent LLMs can be used to cost-effectively automate the re-annotation of sequencing records." Alas, the paper really examines re-annotation of only the fields for "biome" and "location." A weakness of the paper is that the reader doesn't learn what other fields may be relevant in these metadata records, and why the authors chose to focus on the particular fields that they studied. Overall, much more attention should be placed on discussion of the limitations of the work and how well the results might scale to more general problems in standardization of scientific metadata.
  
  Minor comments:
  
  The term "biome" is never well defined.
  
  Frequently, parenthetical remarks begin with "e.g." and end with "etc." This style is redundant; you need only one of these abbreviations in each instance.
  
  Page 6, para 1: "last three digits" or "last three characters"? What is the motivation for consolidating reference ontologies into a single dictionary?
  
  Page 13, para 2: The notion of "lenient matches" requires much more discussion. If the goal is to make the legacy metadata standards-adherent, then a "lenient" match would not seem to be valid. The operative question is, "What metadata terms will users invoke to search for datasets?", and presumably users will be searching for standard terms only.
  
  Page 17, para 3: It's unclear what is meant by "most samples have fewer misclassifications." Fewer than what?
  
  Page 24, para 2: It's not clear what you mean when you say, "In half of the cases GPT correctly predicted the location, while the lat/lon coordinates parsed from the metadata were incorrect." Are you saying that GPT gave correct results when the lat/lon data were incorrect?
  
  Figure 1 is very busy and the tiny font is hard to read.
2. GigaScience 24 Feb 2026
  
  in GigaScience
  
  AbstractOver the past decade, central sequence repositories have expanded significantly in size. This vast accumulation of data holds value and enables further studies, provided that the data entries are well annotated. However, the submitter-provided metadata of sequencing records can be of heterogeneous quality, presenting significant challenges for re-use. Here, we test to what extent large language models (LLMs) can be used to cost-effectively automate the re-annotation of sequencing records against a simplified classification scheme of broad ecological environments with relevance to microbiome studies, without retraining.We focused on sequencing samples taken from the environment, for which metadata is important. We employed OpenAI Generative Pretrained Transformer (GPT) models, and assessed scalability, time and cost-effectiveness, as well as performance against a diverse, hand-curated ground-truth benchmark with 1000 examples, that span a wide range of complexity in metadata interpretation. We observed that annotation performance markedly outperforms that of a baseline, manually curated, non-machine-learning keyword-based approach. Changing models (or model parameters) has only minor effects on performance, but prompts need to be carefully designed to match the task.We applied the optimized pipeline to more than 3.8 million sequencing records from the environment, providing coarse-grained yet standardized sampling site annotations covering the globe. Our work demonstrates the effective use of LLMs to simplify and standardize annotation from complex biological metadata.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giag015), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 2：
  
  Reproducibility report for: Enhanced semantic classification of microbiome sample origins using Large Language Models Journal: Gigascience ID number/DOI: GIGA-D-25-00316 Reviewer(s): Laura Caquelin, Department of Clinical Neuroscience, Karolinska Institutet, Sweden [Wrote the report and reproduced the results]
  
  Summary of the Study
  
  This study evaluates whether Large Language Models (LLMs) can help re-annotate sequencing records. Using GPT models, the authors tested scalability, time, cost, and performance against a benchmark of 1,000 hand-curated examples. They then applied this approach to million environmental sequencing records, producing standardized annotations.
  
  Scope of reproducibility
  
  According to our assessment the primary objective is: to evaluate how closely GPT's annotation performance approached that a human expert when classifying environmental sequencing samples into biomes and sub-biomes.
  
  Outcome: Accuracy of biome and sub-biome classification compared against a hand-curated benchmark dataset.
  
  Analysis method outcome: As described to validate biome classifications: "For paired comparisons of repeated sample IDs, we use the McNemar test, which is appropriate for paired binary outcomes (True/False)", while "for comparisons across different sample sets, we employ the t-test for independent samples. In both scenarios, a Bonferroni correction is applied to adjust for multiple comparisons".
  
  For sub-biomes, comparisons across different sets were performed with independent t-tests, while "for runs involving the same sample IDs, comparisons are performed using the paired t-test". Section "Validation statistics" page 13-14.
  
  Main result: "The improvement in accuracy between GPT's initial classification and the human's performance with the improved prompt was statistically significant (adj p-value=0.031), but also between the human's attempt first attempt (with the initial prompt) and the human's second attempt (better prompt) (adj p-value≤0.001). No significantly different performances were detected for the sub-biome classification, neither between GPT and the human, nor between prompt versions (adjp-value=1)." Section "Human versus GPT classification accuracy" pages 18-19.
  
  Availability of Materials
  
  a. Data
  
  Data availability: Open
  
  Data completeness: Complete = all data necessary to reproduce main results are available
  
  Access Method: Repository
  
  Repository: https://zenodo.org/records/16100607
  
  Data quality: Complete but no metadata associated with the file
  
  b. Code
  
  Code availability: Open
  
  Programming Language(s): Python
  
  Repository link: https://github.com/GaioTransposon/metadata_mining/tree/main
  
  License: CC0
  
  Repository status: Public
  
  Documentation: Readme file clear but require one modification
  
  Computational environment of reproduction analysis
  
  Operating system for reproduction: MacOS 15.6.1
  
  Programming Language(s): Python
  
  Code implementation approach: Using shared code
  
  Version environment for reproduction: Python 3.13.7
  
  Results
  
  5.1 Original study results
  
  Results:
  
  "The improvement in accuracy between GPT's initial classification and the human's performance with the improved prompt was statistically significant (adj p-value=0.031), but also between the human's attempt first attempt (with the initial prompt) and the human's second attempt (better prompt) (adj p-value≤0.001). No significantly different performances were detected for the sub-biome classification, neither between GPT and the human, nor between prompt versions (adjp-value=1)."
  
  (The authors identified an error in the manuscript text during the review. Therefore, the following part of the manuscript needs to be updated (see email exchange with the authors below).
  
  5.2 Steps for reproduction
  
  -> Run the two first scripts of the container 4 in the Github: validate_biomes_subbiomes.py and overall_analysis.py
  
  Issue 1: The README instructions for setting up the ~/MicrobeAtlasProject directory can lead to a nested folder structure (~/MicrobeAtlasProject/MicrobeAtlasProject) if followed literally. This causes the Docker container to fail when attempting to access required files like gpt_file_label_map.tsv, since they are not found at the expected path /MicrobeAtlasProject/.
  
  -- Resolved: The issue was resolved by manually renaming and flattening the directory structure after extraction, ensuring that the contents of MicrobeAtlasProject_Zenodo are directly placed inside ~/MicrobeAtlasProject/. However, the current instructions can mislead users, so a clarification in the README would be helpful.
  
  Issue 2: During the execution of the overall_analysis.py script, multiple files with the same label were found, requiring manual selection of the file to use for the analysis.
  
  -- Resolved: The manuscript does not specify which file should be selected to reproduce the results, leading to potential ambiguity. By default, I chose the most recent file among the options, assuming it reflects the final data version used in the manuscript. It would be helpful if the documentation or manuscript explicitly stated this to ensure exact reproducibility.
  
  -> Compare the results reproduced to the results presented in the manuscript
  
  Issue 3: The results obtained by running validate_biomes_subbiomes.py are two files: biome_subbiome_results.csv and biome_subbiome_stats.csv, which contain a large amount of output (1,284 and 48,197 rows respectively). The script overall_analysis.py provides overall performance metrics in the terminal output, but does not produce the adjusted p-values relevant to the scope of this review.
  
  -- Unresolved: It was difficult to identify where to find the results presented in the manuscript, so an email was sent to the authors.
  
  Message sent by the authors
  
  Dear reviewer, 
  
  By running validate_biomes_subbiomes.py as described (using gpt_file_label_map.tsv as --map_file), the output will be two .csv files named biome_subbiome_results.csv and biome_subbiome_stats.csv. 
  
  The latter file will contain the stats (hence the adjusted p-values). Were you able to reproduce such files? 
  
  We did notice there are a few mistakes. 
  
  Mistake 1. 
  
  In the manuscript it says: 
  
  " A trained molecular biologist, with no prior exposure to the project, was given the same prompt instructions as GPT and was asked to classify sample biomes and sub-biomes. While against the benchmark dataset, GPT achieved an accuracy of 79.76% (n=499; SD=40.0), the human annotator reached 78.0% (n=250; SD=33.0). "
  
  The second standard deviation should be replaced with SD=42.0. 
  
  Mistake 2. 
  
  In the manuscript it says: 
  
  " The improvement in accuracy between GPT's initial classification and the human's performance with the improved prompt was statistically significant (adj p-value=0.031), but also between the human's attempt first attempt (with the initial prompt) and the human's second attempt (better prompt) (adj p-value≤0.001). "
  
  The first adjusted p-value should not be 0.031 but 0.134 hence not significant so this sentence should be adjusted to: 
  
  " There was an improvement in accuracy between the human's first attempt (with the initial prompt) and the human's second attempt (better prompt) (adj p-value≤0.001). "
  
  Mistake 3. 
  
  We built on biome_subbiome_results.csv and biome_subbiome_stats.csv further than necessary so the two files on Zenodo should be "cut" earlier to avoid confusion. This was a problem of the script validate_biomes_subbiomes.py which concatenates on existing files (e.g.: biome_subbiome_results.csv and biome_subbiome_stats.csv) instead of creating new ones. We should probably proceed by replacing these two files with the files without repetitions. 
  
  We thank you for your work and please do let us know if everything works out now. 
  
  Thank you and kind regards, ####
  
  The authors confirm that the results presented in the manuscript can be found in the biome_subbiome_stats.csv file. This file contains 48197 rows. According to the authors, the file includes data from both existing files and new data. As a result, it is difficult to determine which data have been reproduced. Even when using a Ctrl+F search for the reported p-value (e.g, pvalue = 0.134) in the Excel file, this value appears in several rows labeled under different configurations such as (label1/label2): --- chunk_size3000/sync_chunkN_presp0.0; --- chunk_size3000/sync_chunkN_temp1.5; --- chunk_size5000/gpt4-0613; --- machine/ sync_chunkY_topp0.0, etc…
  
  5.3 Statistical comparison Original vs Reproduced results
  
  Results: The biome_subbiome_stats.csv file was reproduced, but it is difficult to distinguish between the newly reproduced data and the existing data already present in the file. Additionally, the data presented in the manuscript are also hard to identify due to the size of the file. No comparison was performed.
  
  Comments: -
  
  Errors detected: Authors identified an error in the manuscript text during the review with the first adjusted p-value that is not 0.031 but 0.134.
  
  Statistical Consistency: No comparison was performed.
  
  Conclusion
  
  Summary of the computational reproducibility review
  
  The main scripts to reproduced the results were successfully executed and the output files were generated. However, due to the size of the output files and the lack of precise references in the manuscript, it was difficult to identify which parts of the output correspond to the results presented in the paper. Moreover, authors mentionned that the script adds data to existing output files rather than generating new ones, making it hard to distinguish between old and new data. This led to confusion when trying to compare the reproduced results with those in the manuscript. Then a comparison of statistical values was not possible.
  
  Recommendations for authors
  
  To improve the reproducibility of the manuscript, we recommend the authors to:
  
  -- Clarify instructions in the README about the MicrobeAtlasProject folder. -- Ensure scripts generate new outputs or clarify which data is new vs. existing in the files with for example a column indicating the origin (e.g., "new" or "existing"). -- Link results in the manuscript to specific rows/sections in the output files to easily locate the exact data used. Another solution could be to consider including a smaller, or a filtered version of the output files with only the rows used for key reults, figures or tables, to make checking results easier and avoid error. -- Metadata: For the data used or generated by the scripts, it would be helpful to include accompanying metadata files that explain: --- The definition of each variable name. --- The origin of each dataset (raw, processed, etc). --- Any preprocessing steps applied before analysis.
3. GigaScience 24 Feb 2026
  
  in GigaScience
  
  AbstractOver the past decade, central sequence repositories have expanded significantly in size. This vast accumulation of data holds value and enables further studies, provided that the data entries are well annotated. However, the submitter-provided metadata of sequencing records can be of heterogeneous quality, presenting significant challenges for re-use. Here, we test to what extent large language models (LLMs) can be used to cost-effectively automate the re-annotation of sequencing records against a simplified classification scheme of broad ecological environments with relevance to microbiome studies, without retraining.We focused on sequencing samples taken from the environment, for which metadata is important. We employed OpenAI Generative Pretrained Transformer (GPT) models, and assessed scalability, time and cost-effectiveness, as well as performance against a diverse, hand-curated ground-truth benchmark with 1000 examples, that span a wide range of complexity in metadata interpretation. We observed that annotation performance markedly outperforms that of a baseline, manually curated, non-machine-learning keyword-based approach. Changing models (or model parameters) has only minor effects on performance, but prompts need to be carefully designed to match the task.We applied the optimized pipeline to more than 3.8 million sequencing records from the environment, providing coarse-grained yet standardized sampling site annotations covering the globe. Our work demonstrates the effective use of LLMs to simplify and standardize annotation from complex biological metadata.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giag015), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 1：
  
  The manuscript presents a carefully executed study using non-finetuned GPT models to classify microbiome sample metadata. It is very well written, and both the analyses and the interpretations are generally sound. I found the evaluation thorough and the presentation clear.
  
  The study provides a detailed evaluation of LLM-based metadata curation, clearly advancing over keyword-based approaches. However, it is surprising that recent related studies using LLMs for metadata curation are not cited. For completeness, I suggest including references such as:
  
  https://doi.org/10.1093/gigascience/giaf070 (disclaimer: I am an author of the paper. You might like the table 3),
  
  https://doi.org/10.1093/bib/bbad535,
  
  https://doi.org/10.3897/phytokeys.261.158396,
  
  https://pmc.ncbi.nlm.nih.gov/articles/PMC12099408/.
  
  The scale of the processed data is impressive. However, there appears to be a discrepancy: the Zenodo repository file metadata.out contains 2,254,619 accession IDs (presumably the input), while the GPT output files (gpt_clean_output) include only around 1,000 samples (presumably the benchmark dataset), whereas the manuscript states that 3.8 million samples were processed. It would be helpful to clarify these numbers and, if applicable, explain why fewer outputs are provided. I also recommend reorganizing the Zenodo repository so that readers can download individual files rather than the entire large archive.
  
  The data processing pipeline on GitHub is very useful. The repository currently indicates a CC0 (public domain) license. Since CC0 is typically intended for datasets rather than source code, please clarify whether this was intentional or if a software-specific license (e.g., MIT, Apache 2.0) would be more appropriate.
  
  A different typeface appears in some paragraphs (e.g., pp. 19 and 21). Please check whether this was intentional.
  
  The finding that grouping 5-17 samples per request does not substantially affect accuracy is interesting. Given that GPT models often fail with counting or item listing, the observed quality decline with larger chunk sizes seems reasonable and aligns with expectations.
  
  On p. 26, the observed variability in field usage may be linked to the BioSample package system used for submission (see: https://www.ncbi.nlm.nih.gov/biosample/docs/packages/). Some fields, such as env_biome and env_feature, were once mandatory for environmental samples but are currently optional, I suppose. Such historical changes may partly explain biases in field usage.
  
  The manuscript appropriately highlights the presence of ambiguous or unresolvable sample descriptions. We reached a similar conclusion in our own work with local LLMs: in many cases, even expert curators cannot determine a "correct" label, and the right answer may depend on context or application.
  
  The observation that JSON output significantly improves sub-biome classification accuracy is intriguing and consistent with our internal experience with local LLMs. Since output format may also affect processing speed, it would be useful to report whether response times differed between JSON and inline formats.
  
  One major limitation of the study is the dependence on proprietary GPT models accessible only via OpenAI's API. This constrains reproducibility and long-term availability. Indeed, the recent release of GPT-5 already renders some of the reported results outdated. While the present study remains highly valuable, it would be worthwhile to also evaluate local or open-source LLMs to ensure future reproducibility.
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2025.04.24.650461v1
www.biorxiv.org www.biorxiv.org

An Interpretable Graph-Regularized Optimal Transport Framework for Diagonal Single-Cell Integrative Analysis

2
1. GigaScience 24 Feb 2026
  
  in GigaScience
  
  AbstractBackground Recent advancements in single-cell omics technologies have enabled detailed characterization of cellular processes. However, coassay sequencing technologies remain limited, resulting in un-paired single-cell omics datasets with differing feature dimensions;Finding we present GROTIA (Graph-Regularized Optimal Transport Framework for Diagonal Single-Cell Integrative Analysis), a computational method to align multi-omics datasets without requiring any prior correspondence information. GROTIA achieves global alignment through optimal transport while preserving local relationships via graph regularization. Additionally, our approach provides interpretability by deriving domain-specific feature importance from partial derivatives, highlighting key biological markers. Moreover, the transport plan between modalities can be leveraged for post-integration clustering, enabling a data-driven approach to discover novel cell subpopulations;Conclusions We demonstrate GROTIA’s superior performance on four simulated and four real-world datasets, surpassing state-of-the-art unsupervised alignment methods and confirming the biological significance of the top features identified in each domain. The software is available at https://github.com/PennShenLab/GROTIA.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giag012), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 2：
  
  This paper introduces a graph-regularized optimal transport framework, GROTIA, for aligning multi-omics datasets. It is a diagonal integration method capable of aligning single cells without requiring direct cell-cell correspondences. The interpretable embeddings produced by GROTIA are particularly impressive and broaden the applicability of diagonal integration approaches. Overall, the paper is clearly written and well-structured. I only have a few minor comments: 1. Kernel-based methods are typically limited in scalability since they require optimization over the entire kernel matrix. How do the authors address this issue? Can the authors also provide more details on the computational efficiency of the model? 2. The optimization procedure for Equation (9) is not sufficiently clear. A more detailed algorithmic description can be very helpful. 3. Can the interpretable embeddings introduced here be generalized to other kernel-based methods, such as MMD-MA? 4. A more comprehensive robustness analysis with respect to parameter choices can be helpful
2. GigaScience 24 Feb 2026
  
  in GigaScience
  
  AbstractBackground Recent advancements in single-cell omics technologies have enabled detailed characterization of cellular processes. However, coassay sequencing technologies remain limited, resulting in un-paired single-cell omics datasets with differing feature dimensions;Finding we present GROTIA (Graph-Regularized Optimal Transport Framework for Diagonal Single-Cell Integrative Analysis), a computational method to align multi-omics datasets without requiring any prior correspondence information. GROTIA achieves global alignment through optimal transport while preserving local relationships via graph regularization. Additionally, our approach provides interpretability by deriving domain-specific feature importance from partial derivatives, highlighting key biological markers. Moreover, the transport plan between modalities can be leveraged for post-integration clustering, enabling a data-driven approach to discover novel cell subpopulations;Conclusions We demonstrate GROTIA’s superior performance on four simulated and four real-world datasets, surpassing state-of-the-art unsupervised alignment methods and confirming the biological significance of the top features identified in each domain. The software is available at https://github.com/PennShenLab/GROTIA.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giag012), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 1：
  
  The manuscript presents a well-motivated and technically elegant approach to diagonal single-cell data integration, combining optimal transport with graph-based regularization to achieve a balance between global and local structure alignment. The method addresses an important challenge in single-cell data integration, where existing approaches still leave room for improvement. Its embedding design offers the potential for interpretable feature-level insights, a particularly desirable quality in single-cell multi-omics integration where biological interpretability is especially important.
  
  That said, the manuscript would be substantially strengthened by deeper validation and a clearer demonstration of reproducibility. Some claims would benefit from stronger empirical support in the presented results, and a more thorough evaluation of the method's added value relative to unimodal alternatives, particularly in the context of marker gene discovery and the identification of cell types or subpopulations, could further enhance the manuscript. Additionally, the impact of key parameter choices, such as kernel bandwidth selection, the number of nearest neighbors (k), and sensitivity to hyperparameters (λ, ρ), should be more fully explored, reported, or justified. Reproducibility could be improved by providing scripts and a computational environment or container to replicate all analyses and figures presented in the manuscript. Usability would also be improved by providing the method as an installable Python package, rather than limiting implementation to a Jupyter Notebook.
  
  Overall, the manuscript introduces a compelling methodological framework with meaningful potential for applications in single-cell integration. The suggestions that follow are intended to help the authors strengthen their contribution in alignment with GigaScience's emphasis on openness, reproducibility, and FAIR principles. I hope these suggestions will help strengthen the support for the authors' conclusions, clarify the reasoning behind key arguments, and improve the clarity and interpretability of the figures and descriptions.
  
  Reproducibility
  
  Reproducibility is impeded by the absence of clearly organized scripts or workflow files to regenerate the results, figures, and tables presented in the manuscript. While some outputs are shown or alluded to in the Jupyter Notebook found in the linked GitHub repository, they are not clearly cross-referenced with the paper's results, making it difficult to confirm how specific figures or tables were produced. Furthermore, no computational environment specification is provided, which makes replication with confidence impossible. Certain aspects of the manuscript fall short of best practices for transparent and reproducible research. Analysis scripts are incomplete or undocumented, and key portions of the software pipeline are either insufficiently described or lack proper attribution. These limitations hinder reproducibility and reduce reusability. Figures would also benefit from clearer annotation. Collectively, these shortcomings detract from alignment with the FAIR principles emphasized by GigaScience. Reproducibility would be significantly improved by packaging the software, versioning the code, defining and documenting the computational environment, and depositing all components of the analysis pipeline, including preprocessing scripts, evaluation code, and figure generation, in a publicly accessible repository.
  
  Additionally, while it is generally clear how the data were collected and curated, the rationale for using preprocessed datasets, particularly those sourced from external repositories, could be more clearly explained. The data are shared via a Google Drive link provided in the GitHub repository, which is convenient, though it may benefit from a more transparent and persistent form of distribution. The manuscript states that "All data used in this manuscript is publicly available and can be found at Liu et al. [11], Cheow et al. [16], Demetci et al. [12], Chen et al. [17], Cao et al. [14], and Samaran et al. [13].", but it appears that preprocessed versions of these datasets were used, rather than the original raw data. Clarifying this point would help improve transparency and reproducibility.
  
  The manuscript also describes custom preprocessing procedures for scRNA-seq and scATAC-seq data, including PCA, TF-IDF normalization, and gene filtering, that appear inconsistent with the properties of the datasets used. Without access to preprocessing scripts or further clarification, it is unclear whether these procedures were performed as described. Clarifying these discrepancies would strengthen transparency and ensure fair benchmarking comparisons. In addition, to improve transparency and reproducibility, it would be helpful to provide the scripts or commands used to run these baseline methods, along with the evaluation code for computing the reported metrics.
  
  Finally, several methodological details underlying downstream analyses are insufficiently described to allow confident reproduction or interpretation. For instance, it is unclear which dataset was used to obtain the results in "GROTIA Reveals Gene-Specific Contributions and Key Biological Processes in the RNA Embedding" and Figure 4 and 5. Additionally, the motif discovery step using GimmeMotifs should be expanded, since it is currently not entirely clear how motifs were matched to known transcription factors, and the process described in the text does not fully align with what is shown in Figure 5A. Clarifying these points would help improve the reproducibility and interpretability of the manuscript's key biological findings.
  
  Usability
  
  The code repository is easy to find on GitHub, available under the MIT license, following the link presented in the manuscript. However, the currently presented implementation is provided as a Jupyter Notebook that demonstrates the basic usage of the method, and technically allows users to replicate the process using their own data. Usability is currently limited by sparse documentation and could benefit from guidance on input requirements, parameter configuration, and expected output formats. To improve usability, the authors should supplement the notebook with detailed explanations, comments, and a README or user guide that explains how to prepare input data, adjust key parameters, interpret outputs, and run the method on other datasets. Wrapping core functionality into a small, importable Python module or script would further reduce friction for adoption and integration into pipelines.
  
  Attribution and Software Transparency
  
  The GitHub repository includes an evals.py script originally authored by the creators of SCOT (Pinar Demetci, Rebecca Santorella, and Ritambhara Singh), with attribution preserved within the file. However, the manuscript itself does not mention that components of the evaluation pipeline were adapted from this prior work. Given that this script supports benchmarking comparisons central to the paper's conclusions, explicit acknowledgment in the text would improve transparency and ensure appropriate credit is given.
  
  Support for Claims and Biological Interpretation
  
  Several key claims would benefit from additional evidence or clarification. I divide this into subsections "4a. Methodological Claims," "4b. Biological Interpretation," and "4c. Clustering Evaluation" for extra clarity and readability.
  
  4a. Methodological Claims - The claim "we selected the latent dimension to be either 5 or 8 and observed that GROTIA remained robust to this choice" is not substantiated by any reported results or sensitivity analysis. - The claim that GROTIA is computationally efficient would be more compelling if runtime comparisons included system specifications, analysis on larger (potentially synthetic) datasets, memory usage, and scalability assessments across CPU and GPU modes. Directly referencing Table A1 for the current runtime evaluation and adding the additional metrics mentioned above would provide a more comprehensive evaluation. - The manuscript asserts "Notably, unlike methods that require shared features across modalities, GROTIA only assumes that cells (rather than individual genes or peaks) follow a similar distribution if they belong to the same type or lineage—thus broadening its applicability to complex datasets." This claim would be more convincing if supported by analyses on more complex datasets, such as those with technical variability across origin sites, donors, or protocols; mosaic structures with missing observations; nested batch effects; or significant differences in data quality. Additionally, this statement may appear in tension with the claim that GROTIA depends on the presence of a shared underlying biology, which would not hold in many complex or heterogeneous settings. Clarifying how "complexity" is defined in the context of GROTIA's assumptions, and empirically substantiating the method's generalizability to such settings would improve both the precision and credibility of this claim. - While the manuscript assesses alignment quality using Fraction of Samples Closer Than the True Match (FOSCTTM) and Label Transfer Accuracy (LTA), capturing local alignment and biological label concordance, these metrics do not directly evaluate preservation of global structure. Since GROTIA is designed to balance both global and local alignment, it would be helpful to include an explicit global alignment metric to confirm that this objective is being met. Some of the provided figures (e.g., Fig. 2c, right panel, and Fig. 3b after alignment) suggest global structure is preserved, but incorporating a dedicated metric or discussion would strengthen the evidence and provide a more complete evaluation of alignment quality. - Likewise, the manuscript states that GROTIA employs orthogonality constraints within the Reproducing Kernel Hilbert Space (RKHS) to enhance interpretability and stability. The use of these constraints for interpretability is illustrated through feature importance analyses; however, there is no direct comparison showing that this approach yields improved interpretability relative to unimodal analyses. Additionally, the effect of orthogonality constraints on embedding stability is not clearly assessed. Providing empirical evidence that these constraints improve the consistency of the embeddings or the quality of feature discovery, particularly in relation to single-modality methods, would help confirm the added value of this design choice and support several of the broader claims made regarding marker gene discovery and cell population characterization. - The decision to exclude scConfluence from the scGEM and SNARE evaluations due to prior dimensionality reduction could be better substantiated. Since raw data for both datasets are publicly available (e.g., SNARE-seq on GEO, scGEM on SRA), it would be helpful to explain why reprocessing the data was not feasible or appropriate.
  
  4b. Biological Interpretation - The reasoning in the statement "Notably, GROTIA requires no a priori matching of features across modalities, so these dimension-specific drivers offer an unbiased method to uncover potential marker genes" is somewhat unclear. While the method's ability to operate without explicit feature matching is a strength, it would be helpful to clarify how this property directly leads to unbiased marker discovery. In particular, elaborating on how the dimension-specific drivers compare to features identified through unimodal or matched-feature approaches, would strengthen the interpretation. - Several statements related to cell-type-specific gene expression, such as "LYZ, ZEB2, PLXDC2 are highly expressed in monocytes…", would benefit from appropriate citations. This applies to other claims throughout the manuscript regarding gene specificity for particular lineages or subtypes.
  
  4c. Clustering Evaluation - The claim that GROTIA achieves "comparable or better performance" than Louvain clustering is not fully supported. While ARI/NMI scores of 0.75-0.8 indicate reasonable alignment with reference annotations, clarity on how ground truth (reference) labels were defined, whether Louvain resolution parameters were tuned, and which dataset(s) were used would strengthen this comparison. Additionally, specifying which co-clustering algorithm was used from the cited Python package, along with its parameter settings, would improve reproducibility and interpretability. - The claims that GROTIA can uncover finer structures and novel cellular states, as well as identify refined subpopulations aligned with major cell types, are intriguing but would benefit from additional support. As currently presented, the results do not highlight specific novel cell populations or provide examples of newly discovered subclusters.
  
  Writing, organization, tables, and figures, and minor notes
  
  There is a typo in the heading "GROTIA integrated simulated datasets in both semi and unsuperviseed setting" where unsuperviseed should be unsupervised.
  
  Under this heading, the section describing Figure 2a in paragraph two and paragraph three largely overlap.
  
  The results and interpretation of Figure 2 panel b and c are not described to the reader. The same is true for Figure 3 panels b and c.
  
  In Figure 3, the method is still labeled as GROT instead of GROTIA; this should be updated for consistency.
  
  In Figure 3, the abbreviations Semi Acc and Un Acc are not defined in the legend and should be clearly explained.
  
  In Figure 3, the visual layout in panel b differs between datasets and may be confusing for scGEM and SNARE-seq, the left and right columns represent cell types from each modality, whereas for PBMC, they reflect cell type and modality origin from a single, combined dataset. The PBMC-style presentation is more effective for visually assessing global alignment and should either be used consistently or more clearly explained.
  
  In Figure 3, legends are also missing descriptions of the color schemes used to denote modality.
  
  In panel c of both Figures 2 and 3, it should be specified whether the results correspond to semi-supervised or unsupervised alignment.
  
  In statements such as "Figure 4b presents UMAP visualizations of the top gene expression patterns for Dimensions 1 and 3", the wording could be clarified to avoid confusion. Specifically, it would help to state that gene expression patterns are overlaid on a UMAP projection of the scRNA-seq data, and that the genes visualized were selected based on their importance in Dimensions 1 and 3 of the RBF kernel embeddings (not UMAP axes).
  
  In Figure 4 panel a, it appears that several genes from D1-4 have higher importance in D5. Is this due to scaling, or does it have some biological interpretation?
  
  In Figure 4, panel c, the colorbar should be labeled.
  
  In Figure 5, panel a, only chromosome identifiers are shown, making the peak information incomplete and difficult to interpret. Including specific peak coordinates would improve clarity.
  
  In Figure 5d, it is not clear how accessibility is quantified for a specific gene, this should be described in the Methods section and reiterated in the results description.
  
  While the context makes it clear, explicitly noting that SPI1 is also known as PU.1 could improve clarity for readers less familiar with the nomenclature.
  
  The explanation of the proposed regulatory relationship between CEBPB and KLF4 could be strengthened. The manuscript notes that both factors cooperate with PU.1, but no direct link between CEBPB and KLF4 is established, aside from their shared involvement in monocyte development and differentiation.
  
  The statement that "co-expression networks further link CLEC7A with an IRF8-centered module" would be more convincing with a supporting citation or additional methodological detail on how this link was established.
  
  The description of Figure 5c could be expanded. The current phrasing, "validated through literature. For instance, FOS are implicated as potential regulators of KLF4 in Dimension 1 and CEBPB of FCR1G in Dimension 2" would be better placed in the main text, supported by citations, and more clearly connected to the results.
  
  It would strengthen the interpretation if claims about the cell-type specificity of TF-target pairs were explicitly linked to the expression patterns shown in Figure 5, panel d.
  
  Figure 5 panel d is missing a label on the color bar.
  
  "gene" in "as potential regulators of Gene KFL4" in the legend of Figure 5 should not be capitalized.
  
  The section identifier is missing from the statement "For further details, please refer to Section ."
  
  Figure panel 6b is missing a label for RNA on the y-axis.
  
  The referencing in a few instances could be strengthened for clarity and accuracy. For example, the statement "Lots of computational methods have recently been developed to integrate data across multiple modalities [4, 5]" cites only two methods, which may not sufficiently support the breadth implied. Either citing additional representative methods or rephrasing the sentence to more accurately reflect the scope would improve the credibility of the claim.
  
  To further support the claim that "GROTIA delivers comparable or superior performance," the authors might consider including comparisons to other recent diagonal integration methods such as Pamona and the updated version of SCOT: SCOTv2.
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2024.10.30.621072v2
www.biorxiv.org www.biorxiv.org

Improved genome assembly of whale shark, the world’s biggest fish: revealing “chromocline” in intragenomic heterogeneity

2
1. GigaScience 24 Feb 2026
  
  in GigaScience
  
  AbstractHigh-quality chromosome-level assemblies are essential for understanding genome evolution but remain difficult to obtain for large and complex genomes. Here we present a near gap-free genome assembly of the whale shark (Rhincodon typus) generated with long-read sequencing and Hi-C scaffolding, markedly improving contiguity and completeness. In particular, the X chromosome was extended to nearly twice its previous length, and putative pseudoautosomal regions were identified. Moreover, we report the first Y-linked scaffolds for this species. Comparative analyses with the zebra shark revealed exceptionally low substitution rates across the genome. We further detected a negative correlation between chromosome length and synonymous substitution rate (dS), explained by a positional gradient, designated as “chromocline”, in which substitution rates gradually decrease from chromosomal ends toward central regions. Notably, the X chromosome exhibited low dS compared with autosomes of similar size, consistent with male-driven evolution. Our results highlight positional and sex-chromosome effects as key determinants of molecular evolutionary rates. The improved assembly will enable broad application to population-genetic and conservation genomic analyses in the whale shark.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giag014), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 2：
  
  Overall, I find the manuscript well written and clear, and I think that the improved assembly it presents is a sufficiently large step forward to warrant publication. I have no issues with either the construction of the assembly, the procedure is state-of-the-art for non-model organisms, or the bioinformatic analysis regarding substitution rate variation across the chromosomes, which I find well done. Likewise, the detection of Y-chromosome fragments is convincing, in my opinion.
  
  In summary, I see no reason why this MS should not be accepted.
  
  Minor comment: If possible, improve the quality of figure panels 1B (in particular the scale bar) and 2F.
2. GigaScience 24 Feb 2026
  
  in GigaScience
  
  AbstractHigh-quality chromosome-level assemblies are essential for understanding genome evolution but remain difficult to obtain for large and complex genomes. Here we present a near gap-free genome assembly of the whale shark (Rhincodon typus) generated with long-read sequencing and Hi-C scaffolding, markedly improving contiguity and completeness. In particular, the X chromosome was extended to nearly twice its previous length, and putative pseudoautosomal regions were identified. Moreover, we report the first Y-linked scaffolds for this species. Comparative analyses with the zebra shark revealed exceptionally low substitution rates across the genome. We further detected a negative correlation between chromosome length and synonymous substitution rate (dS), explained by a positional gradient, designated as “chromocline”, in which substitution rates gradually decrease from chromosomal ends toward central regions. Notably, the X chromosome exhibited low dS compared with autosomes of similar size, consistent with male-driven evolution. Our results highlight positional and sex-chromosome effects as key determinants of molecular evolutionary rates. The improved assembly will enable broad application to population-genetic and conservation genomic analyses in the whale shark.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giag014), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 1：
  
  This is a solid piece of work. The improved Rhincodon typus assembly is genuinely better than previous versions—substantially higher contiguity, more complete BUSCO recovery, and a more convincing reconstruction of the X chromosome. The identification of candidate Y-linked scaffolds is plausible and methodologically defensible. The downstream evolutionary analyses are generally well executed, and the "chromocline" concept is interesting and—if interpreted with caution—potentially valuable for the field.
  
  I really could not find any important flaws in the methodology. It would be interesting to discuss the small Y scaffolds with recently identified "Y" contigs. Apart from bamboo sharks, a recent assembly of the Carcharhinus amblyrhynchos genome also identified Y scaffolds and pseudo-autosomal regions on X.
  
  Furthermore, as the authors themselves not, several other papers (cited in text) have demonstrated similar "chromoclines". I would perhaps argue that the novelty here is slightly overstated, and defining this as "chromocline" risks sounding like terminology inflation rather than conceptual novelty.
  
  These are clearly fairly subjective comments, which I provide only because the work is otherwise very solid.
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2025.09.06.674125v1
www.biorxiv.org www.biorxiv.org

Single-nucleus multiple-organ chromatin accessibility mapping in the rat

3
1. GigaScience 24 Feb 2026
  
  in GigaScience
  
  SummaryThe chromatin accessibility landscape is the basis of cell-specific gene expression. We generated a multiorgan, single-nucleus chromatin accessibility landscape from the model organism Rattus norvegicus. For this single-cell atlas, we constructed 25 libraries via snATAC-seq from nine organs in the rat, with a total of over 110,000 cells. Cell classification integrating gene activity scores with known marker genes identified 77 cell types, which were strongly correlated with those in published mouse single-cell transcriptome atlases. We further investigated the enrichment of cell type- and organ-specific transcription factors (TFs), the dynamics of T-cell developmental trajectories across organs, and the conservation and specificity of gene expression patterns across species. These findings provide a foundation for further investigations of the cell composition and gene regulatory networks throughout the rat body.HighlightsGeneration of a single-cell atlas of chromatin accessibility in nine organs of the ratCharacterization of cell type- and organ-specific transcription factors (TFs)Dynamics of chromatin accessibility in developing T cells revealed by cross-organ analysisConservation and specificity of gene expression patterns among humans, mice, and rats revealed by cross-species analysisCompeting Interest StatementThe authors have declared no competing interest.Footnotes↵10 Lead contact
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giag013), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 3：
  
  In this study, Ronghai Li and colleagues constructed an extensive multi-organ single-nucleus chromatin accessibility atlas of the model organism Rattus norvegicus. The authors generated a comprehensive dataset encompassing 115,723 single-nucleus chromatin accessibility profiles across nine organs (thyroid, thymus, heart, lung, liver, spleen, kidney, pancreas, and ovary). Each organ was profiled in duplicate or triplicate, thereby ensuring reproducibility and robustness of the dataset. The authors also performed rigorous preprocessing and filtering steps, which provided a high-quality foundation for downstream analyses.
  
  The downstream analyses were multifaceted and thoughtfully executed in the following steps: 1) Low-dimensional visualization of all cells within the atlas, represented by organs, which led to the identification of six major cell types; 2) A census of the distribution of major cell types across the analyzed organs; 3) Integration with existing mouse single-cell RNA-seq atlases to refine cell type annotations (77 cell subtypes) and ensure cross-species comparability; 4) Inference of transcription factor activities from open chromatin profiles, providing important insights into gene regulatory mechanisms.
  
  Building upon these analyses, the authors focused on shared and organ-specific features of endothelial and stromal cells, thereby highlighting both conserved and divergent regulatory programs. Finally, through integration of human, rat, and mouse scRNA- and scATAC-seq atlases for the heart and kidney, they investigated cross-species similarities and differences in gene expression and regulatory patterns, further strengthening the relevance and translational potential of this resource.
  
  In my opinion, the manuscript by Ronghai Li et al. is well written, the data are of very high quality, and the study represents a significant data resource. The rat (Rattus norvegicus) is a widely used and indispensable model organism in biomedical research, particularly for studies related to disease onset, progression, and therapeutic development. The generation of this single-nucleus chromatin accessibility atlas, together with the comprehensive analyses provided, constitutes a valuable resource for dissecting organ- and tissue-specific regulatory landscapes. This work not only enhances our understanding of gene regulation across organs but also facilitates cross-species comparisons that will be of great importance for translational research. I therefore strongly support acceptance of the manuscript by Ronghai Li et al. for publication in GigaScience., contingent only upon minor revisions as outlined below:
  
  Minor revisions:
  
  1) While the data preprocessing and analysis steps are clearly described in the Methods section, the code used for data analysis is currently available only upon request. I believe that future readers and, in particular, potential users of this valuable resource would greatly benefit if the analysis code were made publicly accessible. Open availability of the code would not only enhance transparency and reproducibility but also facilitate broader adoption of the dataset. This is especially important as new single-cell ATAC-seq and RNA-seq datasets become available, since ready access to the analysis pipeline will accelerate and streamline future studies that build upon the provided atlas.
  
  2) Line 131: Stromal cells, immune cells, and endothelial cells from different organs tend to be clustered together (i.e., by cel type) rather than clustered according to the organ of origin or sample batch (Figure 1E). and Line 137: However, these cells tended to cluster by organ rather than by cell type (Figure 1E).
  
  This observation can be clearly seen in the UMAPs (Figure 1C-D), but not in the census plot (Figure 1E). I therefore recommend revising the figure reference to (Figure 1C-D) to improve accuracy and clarity for the reader.
  
  3) Authors often assess, across manuscript, whether cells cluster in a cell type-specific or organ-specific manner. For immune and epithelial cells, however, the conclusions can be confounded, as they vary depending on the analytical method used (e.g., scATAC-seq alone versus integration with scRNA-seq data). Could the authors elaborate on the robustness of these observations with respect to the choice of UMAP parameters, particularly the number of features included and the dimensionality of the LSI applied?
  
  4) The authors used the CIS-BP database of transcription factor motifs to assess cell type-specific transcription factor activities. However, JASPAR, which is a curated database, is more commonly used in scATAC-seq studies. Could the authors clarify the rationale for choosing CIS-BP over JASPAR?
2. GigaScience 24 Feb 2026
  
  in GigaScience
  
  SummaryThe chromatin accessibility landscape is the basis of cell-specific gene expression. We generated a multiorgan, single-nucleus chromatin accessibility landscape from the model organism Rattus norvegicus. For this single-cell atlas, we constructed 25 libraries via snATAC-seq from nine organs in the rat, with a total of over 110,000 cells. Cell classification integrating gene activity scores with known marker genes identified 77 cell types, which were strongly correlated with those in published mouse single-cell transcriptome atlases. We further investigated the enrichment of cell type- and organ-specific transcription factors (TFs), the dynamics of T-cell developmental trajectories across organs, and the conservation and specificity of gene expression patterns across species. These findings provide a foundation for further investigations of the cell composition and gene regulatory networks throughout the rat body.HighlightsGeneration of a single-cell atlas of chromatin accessibility in nine organs of the ratCharacterization of cell type- and organ-specific transcription factors (TFs)Dynamics of chromatin accessibility in developing T cells revealed by cross-organ analysisConservation and specificity of gene expression patterns among humans, mice, and rats revealed by cross-species analysisCompeting Interest StatementThe authors have declared no competing interest.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giag013), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 2：
  
  In this manuscript, Li et al. present a single-nucleus rat atlas of chromatin accessibility, with over 110,000 cells measured. They annotate cell types, compare types across organs, and identify organ-specific or shared features, inferring transcription factors and gene regulatory programs. They also integrate human and mouse tissue to validate many of these findings. I found the manuscript to be of value for many biologists due to the addition of an exhaustive effort to characterize cells in the rat, however the actual analyses and findings are not entirely novel. However, that is not the overall goal of the manuscript, so my enthusiasm remains high. Specific comments are as follows:
  
  The introduction is quite short and relatively shallow, with very little justification for the use of a rat atlas. The authors should elaborate on this justification and provide a more comprehensive overview of the use of these atlases in greater detail, rather than just providing a list of undefined analyses which may not be familiar to the reader.
  
  The authors extracted tissues from a single rat. Although it is beneficial to have a deep characterization of an individual, such a small sample size would not be the best representation for an entire atlas (this is mentioned in the limitations).
  
  The authors could consider integrating other rat snATAC-seq data sets, of which several exist.
  
  The TSS enrichment scores and number of fragments are negatively correlated, why is that?
  
  How reliable are the identified doublets? If the authors run the algorithm to detect doublets on individual samples or replicates versus all samples, are the same cells identified?
  
  Figure 1 has icons and images which are quite small.
  
  The authors mention LSI dimensionality reduction for clustering, but then display everything with UMAP which is known to distort distances.
  
  UMAP is also notoriously bad at handling batch effects, which seem to be present in this atlas. Has any effort been used to mitigate this issue? It would be better to use an alternative visualization that maintains distances.
  
  How reliable are the cell types from the automated annotator given the relationships in the hierarchical clustering? For example, "Immature_T_cell"s are more closely clustered with "Goblet" cells than "Thymic_T_Cell"s, and these types of relationships are throughout the tree. Additional validation should be completed, including at minimum a gene expression scoring of markers for each of these cell types to see if they match the annotation.
  
  To determine the organ specificity of different cell types, the authors can additionally look at, for instance, alveolar / interstitial macrophages in the lung or another tissue or tissue resident fibroblasts, comparing those specific cell types rather than alveolar vs. any other epithelial cell type.
  
  The finding that endothelial clusters were mostly grouped by organs should be verified in the context of batch correction. The authors could also consider clustering befor dimensionality reduction and using consensus clustering using subsampling.
  
  l. 369: CMs not defined.
  
  l. 396: ECs not defined.
  
  "sc" and "sn" used interchangeably (e.g. Figure 5A), which should be "snATAC-seq".
  
  l. 445: The central dogma of molecular biology defined here is stated incorrectly and should be fixed or removed, although it appears irrelevant for the discussion.
3. GigaScience 24 Feb 2026
  
  in GigaScience
  
  SummaryThe chromatin accessibility landscape is the basis of cell-specific gene expression. We generated a multiorgan, single-nucleus chromatin accessibility landscape from the model organism Rattus norvegicus. For this single-cell atlas, we constructed 25 libraries via snATAC-seq from nine organs in the rat, with a total of over 110,000 cells. Cell classification integrating gene activity scores with known marker genes identified 77 cell types, which were strongly correlated with those in published mouse single-cell transcriptome atlases. We further investigated the enrichment of cell type- and organ-specific transcription factors (TFs), the dynamics of T-cell developmental trajectories across organs, and the conservation and specificity of gene expression patterns across species. These findings provide a foundation for further investigations of the cell composition and gene regulatory networks throughout the rat body.HighlightsGeneration of a single-cell atlas of chromatin accessibility in nine organs of the ratCharacterization of cell type- and organ-specific transcription factors (TFs)Dynamics of chromatin accessibility in developing T cells revealed by cross-organ analysisConservation and specificity of gene expression patterns among humans, mice, and rats revealed by cross-species analysisCompeting Interest StatementThe authors have declared no competing interest.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giag013), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 1：
  
  Li et al presents a manuscript where they generated a snATAC-seq atlas of 9 major organs in adult rat and integrated the atlas with mouse and human scRNA-seq and scATAC-seq data, revealing that chromatin accessbility is largely conserved between celltypes across species and that there also tissue-specific regulation in some celltypes even when they are common across several tissues. Overall, this looks like a great carefully analysed and annotated resource that would be useful for the community. I appreciate the amount of work that went into curating and analysing this dataset and i thought that the manuscript was very well written and clear.
  
  I think the most interesting finding is in figure 3 where the authors found unique TFs regulating the same cell-types but in different organs. However, the analysis ends abruptly other than listing these TFs. Can the authors comment on what are the functional consequences/associations of these tissue-specific TFs, perhaps in the discussion?
  
  The raw data is deposited into a database and can be openly downloaded but i find that the lack of processed data e.g. processed and labelled expression matrices or objects may prevent the adoption of this data by the community as it is a lengthy process to reach the author's conclusions. The authors might also want to consider incorporating an interactive platform for users to explore and navigate this dataset.
  
  While i appreciate that the authors have detailed in their manuscripts how they performed the data analysis, i would still encourage the authors to upload their scripts/notebooks to an open code repository otherwise again it would be prohibitive for adoption by the community as it is.
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2024.11.11.622900v1
Jan 2026
www.biorxiv.org www.biorxiv.org

nf-core/proteinfamilies: A scalable pipeline for the generation of protein families

2
1. GigaScience 30 Jan 2026
  
  in GigaScience
  
  AbstractThe growth of metagenomics-derived amino acid sequence data has transformed our understanding of protein function, microbial diversity and evolutionary relationships. However, the vast majority of these proteins remain functionally uncharacterized. Grouping the millions of such uncharacterised sequences with the few experimentally characterised ones allows the transfer of annotations, while the inspection of conserved residues with multiple sequence alignments can provide clues to function, even in the absence of existing functional information. To address the challenges associated with this data surge and the need to group sequences, we present a scalable, open-source, parametrizable Nextflow pipeline (nf-core/proteinfamilies) that generates protein nascent families or assigns new proteins to existing families. The computational benchmarks demonstrated that resource usage can scale approximately linearly with input size, while the biological benchmarks showed that the generated protein families closely resemble manually curated families found in widely used databases.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giag009), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 2: Castrense Savojardo
  
  This manuscript presents a Nextflow pipeline (nf-core/proteinfamilies) for large-scale protein-family generation. Overall, I think the paper is well written and clear. The pipeline appears very useful, and the reported results show good performance in both family reproducibility and computational efficiency.
  
  I have a few minor comments requesting additional details:
  
  1) Does the quality-check step only compute statistics, or is it also used to filter/clean the input set? If so, please specify the criteria and whether filtered sequences are excluded downstream.
  
  2) Which MMseqs2 clustering mode is used (set cover, connected components, or greedy)? Can this be changed within the pipeline? If configurable, please indicate the relevant parameters.
  
  3) In the reproducibility benchmark, you use DIAMOND BLASTp to assess similarity between the initial sequence set for the selected families and additional Swiss-Prot sequences. Which sequence identity and alignment coverage (if any) thresholds were applied?
  
  4) Counts and coverage (p. 6): You state that "These 709 families captured 96.66% of the original unique sequence identifiers (103,385 out of 106,959).". However, a few lines above the final input set is reported as 169,605 unique protein sequences. Could you please clarify the initial number of sequences and the actual coverage after family generation and redundancy reduction?
  
  5) Figures S1 and S2 are difficult to read due to low resolution.
2. GigaScience 30 Jan 2026
  
  in GigaScience
  
  AbstractThe growth of metagenomics-derived amino acid sequence data has transformed our understanding of protein function, microbial diversity and evolutionary relationships. However, the vast majority of these proteins remain functionally uncharacterized. Grouping the millions of such uncharacterised sequences with the few experimentally characterised ones allows the transfer of annotations, while the inspection of conserved residues with multiple sequence alignments can provide clues to function, even in the absence of existing functional information. To address the challenges associated with this data surge and the need to group sequences, we present a scalable, open-source, parametrizable Nextflow pipeline (nf-core/proteinfamilies) that generates protein nascent families or assigns new proteins to existing families. The computational benchmarks demonstrated that resource usage can scale approximately linearly with input size, while the biological benchmarks showed that the generated protein families closely resemble manually curated families found in widely used databases.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giag009), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 1: Vikram Alva
  
  The authors present nf-core/proteinfamilies, a standardized Nextflow workflow that constructs protein families de novo or classifies sequences against existing families. Using a curated 200-family benchmark and a UniRef90-scale run, the authors show that the pipeline attains high recall with efficient runtimes. Given the ever-increasing size of sequence databases, this work is timely and fills a practical gap in reproducible, at-scale family curation; I expect it to be adopted widely by many research groups.
  
  I have several comments and suggestions below:
  
  In my view, this workflow will, by construction, yield a mixture of families: some anchored on a single conserved domain/segment, others centered on recurrent multi-domain cores, and some that capture the full-length sequence. This differs from widely used family databases: Pfam is largely domain-level, whereas HAMAP and NCBIFAM are mostly full-length/isofunctional (with PANTHER sitting in between). The resulting granularity is largely determined by MMseqs2 settings (sequence identity, query/target coverage, coverage mode) and by any alignment trimming, which biases toward conserved cores. Please add a brief discussion making this explicit, with practical guidance for tuning toward full-length versus domain-centric generation of families.
  
  I also recommend a parameter-sensitivity analysis on the 200-family set: sequence identity (30-70%), coverage thresholds (50-95%), and coverage mode (query/target/both), with and without trimming. For each setting, report (i) total families and split/merge rates per curated family, and (ii) a simple granularity readout, the proportion classified as domain-anchored, multi-domain, or full-length. This would clarify how parameter choices drive family counts and domain/full-length centricity, and help readers select defaults aligned with their use case.
  
  In the results, the splits/misses are concentrated in Pfam/PANTHER, while HAMAP/NCBIFAM are much closer to one-to-one (HAMAP 50/50). This suggests the inflated family count is driven, in part, by the domain-centric portion of the benchmark rather than the method itself. Please add a brief note in the Discussion to make this explicit.
  
  Since AFDB has models for most UniProt entries, could these models be used as an orthogonal purity check of the generated families; e.g., map members to AFDB and ask whether they cluster to the same fold by TM-score/Foldseek (allowing full-length differences when the family is domain-anchored)?
  
  HHsearch-based merging of divergent splits. In my view, and the authors note this, several curated families split simply because sequences are very divergent. An optional HHsearch (HMM-HMM) pass could merge these back: merge only at high probability (≈≥95%) with reciprocal coverage of the shorter model (≥0.6). It would be useful to include this as an optional stage in the pipeline.
  
  Optional annotation of de novo families. I think it would be useful to add an annotation step that compares each de novo family (family HMM or MSA) against curated resources (Pfam, NCBIFAM, PANTHER/HAMAP).
  
  Could you briefly outline your expectations for how the pipeline handles transmembrane segments, coiled-coils, repeats, and IDRs, classes prone to over-splitting under MMseqs2 seeding and trimming due to short-motif signal, low complexity, and variable lengths?
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2025.08.12.670010v1
www.biorxiv.org www.biorxiv.org

Expression-Driven Genetic Dependency Reveals Targets for Precision Medicine

4
1. GigaScience 30 Jan 2026
  
  in GigaScience
  
  AbstractCancer cells are heterogeneous, each harboring distinct molecular aberrations and are dependent on different genes for their survival and proliferation. While successful targeted therapies have been developed based on driver DNA mutations, many patient tumors lack druggable mutations and have limited treatment options. Here, we hypothesize that new precision oncology targets may be identified through “expression-driven dependency”, whereby cancer cells with high expression of a targeted gene are more vulnerable to the knockout of that gene. We introduce a Bayesian approach, BEACON, to identify such targets by jointly analyzing global transcriptomic and proteomic profiles with genetic dependency data of cancer cell lines across 17 tissue lineages. BEACON identifies known druggable genes, e.g., BCL2, ERBB2, EGFR, ESR1, MYC, while revealing new targets confirmed by both mRNA- and protein-expression driven dependency. Notably, the identified genes show an overall 3.8-fold enrichment for approved drug targets and enrich for druggable oncology targets by 7 to 10-fold. We experimentally validate that the depletion of GRHL2, TP63, and PAX5 effectively reduce tumor cell growth and survival in their dependent cells. Overall, we present the catalog of express-driven dependency targets as a resource for identifying novel therapeutic targets in precision oncology.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giag011), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 2
  
  The authors introduce BEACON, a Bayesian correlation approach designed to identify expression-driven dependency in cancer. Their hypothesis suggests that cancer cells with elevated expression of specific genes demonstrate increased vulnerability to the knockout of those same genes, thereby unveiling a promising new category of targets in precision oncology—particularly valuable for targeting cancer cells lacking druggable mutations. BEACON models expression levels and dependency scores as bivariate Gaussians and employs Markov Chain Monte Carlo (MCMC) sampling to estimate the correlation coefficient between them. They then compute p-values followed by rigorous multiple testing correction (BH based FDR correction). A notable strength of their approach lies in the integration of mass spectrometry proteomics data alongside transcriptomic and perturbation screening data, enhancing the robustness of their findings. Their work highlights some key insights: - Gene expression-driven dependency (GED) candidates identified across lineages demonstrate enrichment for "DNA-binding transcription activator activity" and "DNA-binding transcription activator activity, RNA polymerase II-specific" pathways. - The analysis successfully identifies compelling candidates with robust signals in both GED and PED (FERMT2, GRHL2, KLF5, CDK6, and CCND1), which are well-supported by existing drug evidence or established literature - Clustering analyses reveal that cancer cells from pancreas and biliary tract tissues, as well as kidney and urinary tract tissue lineages, exhibit remarkably similar expression-driven dependency profiles. Additionally, lineage-specific genes such as transcription factors, cluster together in a manner consistent with existing literature - Through Fisher's exact test, the authors demonstrate significant enrichments of druggable gene lists from DrugBank with expression-driven dependency patterns at both proteomic and transcriptomic levels - Experimental validation shows that PAX5 is essential for PAX5-high B cell lymphoma cell growth, while TP63 and GRHL2 are essential for LSCC cell growth.
  
  However, I have several principal concerns about the study that should be addressed to demonstrate the robust and superior performance of this proposed approach.
  
  Major Comments:
  
  1.Quantitative benchmarking: While the authors present a valuable contribution, the concept of correlating gene dependency scores to expression has been explored previously through approaches like Project DRIVE (E. Robert McDonald, III et al.) and APSIC (Montazeri et al.). BEACON demonstrates strong correlations across multiple lineages, representing broader scope compared to existing methods that appear more lineage-restricted. However, establishing BEACON's comparative advantages requires more rigorous evaluation. Notably, Project DRIVE—a foundational paper in this field—already identified several BEACON candidates in their "Expression Correlation Analysis Identifies Oncogenes and Lineage-Specific Transcription Factors" section, while APSIC characterized many lineage-specific discoveries as tumor effector genes. BEACON's strength lies in integrating proteomic data with transcriptomic and perturbation screens, enabling identification of additional candidates like PAX5 for hematopoietic and lymphoid tissue. To demonstrate the method's impact, I recommend systematic quantitative benchmarking against existing approaches.
  
  Importantly, BEACON utilizes richer/complementary datasets than previous studies. Disentangling contributions of data richness versus methodological innovation would provide valuable insights into whether enhanced performance stems from improved data availability or genuine method improvements.
  
  Overall for benchmarking, the authors are strongly encouraged to utilize any comprehensive datasets that best demonstrate their method's competitive advantage and are not limited to the specific comparisons recommended above.
  
  2.Correlation method comparisons: Figure S2 shows that BEACON exhibits higher MSE at extremes, and the claimed advantage over Pearson for small sample sizes is difficult to quantify from the current visualization. While the theoretical expectation that BEACON should outperform Pearson in small samples is reasonable, the practical significance remains unclear from these simulations. I recommend demonstrating BEACON's advantage using real data by creating a curated list of established GEDs/PEDs and comparing performance between the two methods. This is particularly important since several of BEACON's hits were previously reported by Project DRIVE using simple Pearson correlations. Alternatively, if BEACON's advantage is indeed significant, please elaborate on the simulation results to better justify this claim with clearer quantitative metrics.
  
  3.Validation experiments: I'm seeking clarification on the validation experiments for TP63 and GRHL2. These candidates were not sensitive to predicted dependency and the authors say that "pan-lineage targets may represent universal vulnerability and their inhibition may lead to undesired off-target effects on other cells". Are the authors positioning them as weaker candidates to illustrate the superiority of lineage-specific predictions like PAX5? Additionally, why were different experimental approaches used—CRISPR for PAX5 versus shRNA for TP63 and GRHL2? For a method aimed at identifying druggable targets, would drug based experiments be more relevant than knockdown approaches to better demonstrate clinical applicability?
  
  Minor comments
  
  In Figure 4A, the caption refers to the plot as a heatmap, but the visualization appears to be a scatterplot. Please clarify whether the heatmap is missing or modify the caption appropriately. Additionally, I recommend using a different shade of green, as the current color choice makes some gene names difficult to read.
  
  In Fig S5A, please add a legend for tumor and normal
  
  For the TP63 and GRHL2 validation experiments, please include results for all four cell lines. The current manuscript is missing HCC15-shTP63, HCC15-shGRHL2, and HARA-shGRHL2 plots.
  
  How many replicates were the experiments performed on? Is it N= 3 for all experiments?
  
  Missing some text here - "BEACON offers the unique advantage of utilizing prior distributions that are less susceptible to outliers, especially in multiple lineages where the number of cell lines."
2. GigaScience 30 Jan 2026
  
  in GigaScience
  
  AbstractCancer cells are heterogeneous, each harboring distinct molecular aberrations and are dependent on different genes for their survival and proliferation. While successful targeted therapies have been developed based on driver DNA mutations, many patient tumors lack druggable mutations and have limited treatment options. Here, we hypothesize that new precision oncology targets may be identified through “expression-driven dependency”, whereby cancer cells with high expression of a targeted gene are more vulnerable to the knockout of that gene. We introduce a Bayesian approach, BEACON, to identify such targets by jointly analyzing global transcriptomic and proteomic profiles with genetic dependency data of cancer cell lines across 17 tissue lineages. BEACON identifies known druggable genes, e.g., BCL2, ERBB2, EGFR, ESR1, MYC, while revealing new targets confirmed by both mRNA- and protein-expression driven dependency. Notably, the identified genes show an overall 3.8-fold enrichment for approved drug targets and enrich for druggable oncology targets by 7 to 10-fold. We experimentally validate that the depletion of GRHL2, TP63, and PAX5 effectively reduce tumor cell growth and survival in their dependent cells. Overall, we present the catalog of express-driven dependency targets as a resource for identifying novel therapeutic targets in precision oncology.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giag011), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 4
  
  Reproducibility report for: Expression-Driven Genetic Dependency Reveals Targets for Precision Oncology Journal: Gigascience ID number/DOI: GIGA-D-25-00147 Reviewer(s): Laura Caquelin, Department of Clinical Neuroscience, Karolinska Institutet, Sweden
  
  Summary of the Study The authors developed a Bayesian method called BEACON to integrate multi-omics data. The method was tested on cancer cell lines across 17 tissue types to identify expression- driven dependencies. The method recovered known drug targets and identified novel candidates. The study concludes this method provides a systematic approach to identify precision oncology targets.
  
  Scope of reproducibility According to our assessment the primary objective is: to identify expression-driven dependencies across cancer cell lines from multiple lineages enabling the discovery of genes whose expression levels correlate with cancer cell dependency scores.
  
  Outcome: Identification of genes with significant expression-driven dependencies across pan-lineage cancer cell lines.
  
  Analysis method outcome: "BEACON calculated the Bayesian correlation between the gene's expressions and CERES cancer dependency scores 25 across the pan-lineage cell lines. BEACON modeled expression levels and dependency scores as the bivariate Gaussians and used Markov Chain Monte Carlo (MCMC) sampling to estimate the correlation coefficient rho between them. Given the null hypothesis that the uncorrelated expression and dependency of a gene has the 0 rho coefficient, we statistically tested each gene's rho estimate obtained from the MCMC simulation as follows. Assume that the MCMC sampling is carried out for a null gene's expression and dependency, then we expect that the distribution of the rho estimate accumulated over the MCMC iterations will be centered at zero. Based on this rationale, we computed the z-score of i-th gene as the deviation of the MCMC estimate of rho from the expected (null) value (i.e., zero) in terms of the standard deviation observed in the simulated distribution, i.e., z(i) = rhoMCMC(i) / SDMCMC(i). Since the z-values, by nature, follow a normal distribution with zero-mean and unit-variance, then we computed the p- value for each gene's rho estimate as the probability of observing a value as extreme as the computed z-value for that gene. We multi-testing corrected the resulting p-values using the BH procedure for FDR." (page 19 -Methods section / mRNA expression-driven dependency (GED))
  
  Main result: "We first analyzed the pan-lineage GED by using mRNA levels and the corresponding dependency scores from 854 cell lines with available data across 17 lineages and identified 244 genes showing significant association (correlation coefficient, rho < -0.25, FDR < 0.05)" (page 7 - Results section / Cancer vulnerability targets showing gene expression-driven dependency (GED))
  
  Availability of Materials a. Data
  
  Data availability: Open
  
  Data completeness: Complete, all data necessary to reproduce main results are available.
  
  Access Method: Repository
  
  Repository: https://doi.org/10.6084/m9.figshare.19700056.v2 -Data quality: Structured
  
  b. Code - Code availability: Open - Programming Language(s): R - Repository link: https://github.com/Huang-lab/BEACON - License: MIT license - Repository status: Public - Documentation: Readme file
  
  Computational environment of reproduction analysis
  
  Operating system for reproduction: MacOS 15.5
  
  Programming Language(s): R
  
  Code implementation approach: Using shared code
  
  Version environment for reproduction: R version 4.5.0/RStudio 2025.05.1
  
  Results 5.1 Original study results
  
  Results 1: Supplementary table S2 5.2 Steps for reproduction -> Run the code PanLineageMCMC.R
  
  Issue 1: File import paths and incorrect file name -- Resolved: In the original code, there were fixed file paths that only worked on one specific computer. This caused problems when running the code on other computers. To fix this, I recommended to use relative paths, which are based on where the script is located. This way, the code can be run on any computer without needing to change the paths each time.
  
  ------------------ Start of script ------------------ sam.dep = read.csv(file.path(getwd(), "DepMap_data", "sample_info.csv")) ------------------- End of script -------------------
  
  Issue 2: Missing function "intsect" at line 162 -- Resolved: The script called a function intsect that was not defined, leading to an error. Upon request, the authors provided the missing function and added it to the main script (PanLineageMCMC.R).
  
  Issue 3: Output directory not created. -- Resolved: The script attempted to write output files to a directory that was not created beforehand. This caused errors during the loop execution when trying to save results. A directory check and automatic creation script was added. If the output folder does not exist, it is now created automatically before the loop runs.
  
  ------------------ Start of script ------------------ dir_path <- paste0('../out/jags.nadapt',n.adapt,'.update',n.update,'.mcmc ',n.iter,'.simulation_SD_22Q2') if (!dir.exists(dir_path)) { dir.create(dir_path, recursive = TRUE) } ------------------- End of script -------------------
  
  5.3 Statistical comparison Original vs Reproduced results - Results: Table.mRNA.dependency.Bayesian.pancancer file attached - Comments: The Bayesian PanCancer analysis was re-run, but only on the 244 significant genes listed in Supplementary Table S2, not on the full set of 17 285 genes. This choice was made due to limited computational resources, as running the full model would have required an estimated 100 hours. - Errors detected: - - Statistical Consistency: Among the 244 significant genes originally reported, the reproduced analysis confirmed the statistical significance of these same genes. However, the exact numerical values (Mean, standard deviation, Z value, P-value and adjusted P-value) differed slightly. These discrepancies are expected due to the nature of Bayesian inference, the absence of a random seed, and the relatively low number of MCMC iterations used (n.iter = 500). These settings may not be sufficient to ensure full convergence or reproducibility of posterior estimates and should be interpreted with caution. We were unable to compare the rho values because they were not available in the provided Supplementary table S2, nor extracted in the R code to be include in the resulting output files.
  
  Conclusion
  
  Summary of the computational reproducibility review The results of the Supplementary table S2 in the original study was partially reproduced. We were able to confirm the statistical significance of the 244 genes reported in Supplementary Table S2 using the Bayesian PanCancer model in the provided code. However, the numerical results were not always identical. This is expected because Bayesian methods involve random sampling, the original code did not set a fixed random seed, and the number of iterations used was relatively low. Furthermore, the rho values were not available for comparison, limiting a full reproducibility assessment. Several technical issues were also fixed during the reproduction process, such as hardcoded file paths, a missing function, and the absence of output directories, which were resolved to allow the code to run correctly on a different system. Due to computational limitations, running the full model on all 17,285 genes was not performed.
  
  Recommendations for authors While the original analysis code was successfully used to confirm the statistical significance of the 244 genes, we recommend several improvements to enhance reproducibility: -- Code annotation: Adding more detailed comments within the scripts would help users understand the logic behind each step and the purpose of specific commands or operations. -- Set a random seed: Include set.seed() in all scripts to improve reproducibility across different runs. -- Specify R and package versions: Provide the R version and exact package versions needed to run the code, via a requirements file for example. -- Use relative file paths: Ensure that all necessary folders and functions are created or included by default to avoid path issues. -- Increase MCMC robustness: Use a higher number of iterations and appropriate parameter settings to ensure better convergence and stability of posterior estimates. -- Inform users about computation time: Clearly indicate in the README or publication the expected runtime of the code, especially if it requires several hours or days to complete.
3. GigaScience 30 Jan 2026
  
  in GigaScience
  
  AbstractCancer cells are heterogeneous, each harboring distinct molecular aberrations and are dependent on different genes for their survival and proliferation. While successful targeted therapies have been developed based on driver DNA mutations, many patient tumors lack druggable mutations and have limited treatment options. Here, we hypothesize that new precision oncology targets may be identified through “expression-driven dependency”, whereby cancer cells with high expression of a targeted gene are more vulnerable to the knockout of that gene. We introduce a Bayesian approach, BEACON, to identify such targets by jointly analyzing global transcriptomic and proteomic profiles with genetic dependency data of cancer cell lines across 17 tissue lineages. BEACON identifies known druggable genes, e.g., BCL2, ERBB2, EGFR, ESR1, MYC, while revealing new targets confirmed by both mRNA- and protein-expression driven dependency. Notably, the identified genes show an overall 3.8-fold enrichment for approved drug targets and enrich for druggable oncology targets by 7 to 10-fold. We experimentally validate that the depletion of GRHL2, TP63, and PAX5 effectively reduce tumor cell growth and survival in their dependent cells. Overall, we present the catalog of express-driven dependency targets as a resource for identifying novel therapeutic targets in precision oncology.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giag011), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 3
  
  The authors develop a method for correlating gene and protein expression with cellular dependencies using the resources of DepMap. The innovation appears to be a Bayesian approach to the correlation analysis. They use this approach to identify potential therapeutic targets and evaluate some top candidates using in vitro experiments. The paper is fairly straightforward to follow.
  
  Major comments:
  
  Benchmarking - given the non-linear relationships shown in Fig 2, is a comparison with the Pearson method the most appropriate? Would a Spearman's be better?
  
  The analysis identifies dependencies that are proposed as therapeutic targets, however while the proteins can be druggable, what about normal tissue effects? Some of these are likely lineage-defining proteins that could be highly expressed in normal tissues. Is is notable that in Fig 5B, C that the existing drug targets have a lower association strength than other GEDs identified. Does this suggest that the strongest correlations might be lineage-crucial genes that are too important for normal tissue function to make good drug targets? This needs further consideration in the discussion. Are there any pathways differences between these groups (known drug targets vs others)? For example you might expect more tissue lineage Tfs in the "other" category, while the approved drug targets perhaps more cell surface receptors.
  
  The cell assays performed should effectively be replicating the results of the dependencies on which BEACON is based (DepMap), so why do you get different results? Is it because of the different methods used ie shRNA (not seeing the correlation between expression and dependency) vs CRISPR (replicating the correlation)? If you look at older DepMap scores when they used knockdown rather than CRISPR can you replicate your results?
  
  Although mycoplasma testing was done, were the cell lines re-authenticated by STR profiling at any point?
  
  QPCR is mentioned n the methods but not provided in the results that I can find. Did this validate gene knockdown by shRNA? Any correlation between % KD and proliferation/colony forming effect?
  
  In the discussion it should be acknowledged that cancer subtypes exist within lineages that are molecularly and clinically distinct and so the method might be missing targets specific for these eg ER+ and ER- breast cancer.
  
  Minor comments: 1. Results para 1 "especially in multiple lineages where the number of cell lines." Missing something in this sentence?
  
  Needs some grammar review
  
  3, Please italicise all gene names (when referring to gene, not protein) eg CCNE1 amplification etc
  
  Fig S5A - legend or axis labels for N and T needed.
  
  Fig S5C, D - these are proliferation not colony forming assays as stated in the text.
  
  Please include number of replicates and type of error bars in figure legends for cell assays
4. GigaScience 30 Jan 2026
  
  in GigaScience
  
  AbstractCancer cells are heterogeneous, each harboring distinct molecular aberrations and are dependent on different genes for their survival and proliferation. While successful targeted therapies have been developed based on driver DNA mutations, many patient tumors lack druggable mutations and have limited treatment options. Here, we hypothesize that new precision oncology targets may be identified through “expression-driven dependency”, whereby cancer cells with high expression of a targeted gene are more vulnerable to the knockout of that gene. We introduce a Bayesian approach, BEACON, to identify such targets by jointly analyzing global transcriptomic and proteomic profiles with genetic dependency data of cancer cell lines across 17 tissue lineages. BEACON identifies known druggable genes, e.g., BCL2, ERBB2, EGFR, ESR1, MYC, while revealing new targets confirmed by both mRNA- and protein-expression driven dependency. Notably, the identified genes show an overall 3.8-fold enrichment for approved drug targets and enrich for druggable oncology targets by 7 to 10-fold. We experimentally validate that the depletion of GRHL2, TP63, and PAX5 effectively reduce tumor cell growth and survival in their dependent cells. Overall, we present the catalog of express-driven dependency targets as a resource for identifying novel therapeutic targets in precision oncology.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giag011), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 1
  
  The authors present BEACON, a method for identifying associations between the expression of a gene and sensitivity to the CRISPR knockout of that gene across a panel of cancer cell lines. These 'oncogene like' dependencies represent potential therapeutic targets that might be exploited for the development of new precision medicines in cancer. The issue that BEACON aims to address is the limited sample size (cell line count) in some specific cancer lineages and experimental noise that might result in spurious correlations between expression and CRISPR sensitivity. The authors demonstrate, using a modelling approach, that BEACON is more reliable for estimating correlation than simple Pearson's correlation when there is high-noise in the measurements. The majority of the manuscript focuses on analyses of dependencies systematically identified using the BEACON approach and their enrichment in drug targets and biological pathways. There is some experimental testing of three potential expression driven dependencies presented. The rationale for the overall approach and analyses are clear.
  
  Major comments
  
  Previous efforts have systematically associated gene/protein expression with CRISPR sensitivity across the same or related datasets (e.g. Pacini et al, Cancer Cell 2024 and Rohde et al, Molecular Systems Biology 2025 using CRISPR; McDonald et al, Cell 2017 and Tsherniak et al, Cell 2017 using RNAi) and so the primary contribution of this paper can be considered the development of the BEACON method. It is thus somewhat surprising that there is no real assessment of the improvements offered by BEACON when compared to simpler methods (Pearson correlation, Spearman correlation) or more more complex recent approaches (Rohde et al's BACON approach). The modelling approach suggests some improvements in specific circumstances (especially high noise) but it is not clear that this leads to improved dependency identification in the real data. Does BEACON identify known oncogene addictions better than these methods? Are the associations identified more reproducible (e.g. across alternative CRISPR screens or RNAi screens)?
  
  The experimental validation and the conclusions drawn from it are somewhat confusing. The authors assess three potential expression associated dependencies - two pan-cancer dependencies (GRHL2 and TP63) and one lineage specific dependency (PAX5 in myeloid cells). Only the lineage-specific dependency validated in the way that might be expected, with higher expression associated with increased dependency, leading the authors to conclude that lineage-specific dependencies may be more suitable targets than pan-cancer ones. Given the numbers analysed (3 genes) this suggestion is not well supported. Moreover the perturbation was performed using distinct approaches - CRISPR for PAX5 and shRNA for the other two genes - and only the knockdown of PAX5 was validated by Western blot. It is very hard to know what phenotypes might be a false positive from off-target shRNA effects or false-negatives from variable shRNA knockdown of the target. The results in S5C suggest that the two shRNAs for each gene cause somewhat discordant phenotypes, suggesting there may be some issues with knockdown efficiency. This could potentially be addressed by adding additional shRNAs for GRHL2 / TP63 or testing them using CRISPR perturbation as was done for PAX5. Validation of the knockdown of the intended target could also shed some light here. The manuscript also mentions experiments in an additional cell line (HCC15) but I cannot see these results presented in the main figures or supplement. It would be useful if all results for these two genes were presented in a single figure, with high and low expressing cell lines clearly marked,
  
  Minor:
  
  Previous work has established that in some cases lower expression of a gene can make cells more vulnerable to its perturbation (CYCLOPS genes, Nijhawan et al, Cell 2012). While these are not the focus of this manuscript, it would be useful for the authors to comment on the utility of BEACON for their identification.
  
  p14 "Moreover, GED/PED targets were depleted of genes that were Essential In Culture" - it's not clear what this means or where the data comes from. By definition the gene set analysed are at least somewhat essential in culture
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2024.10.17.618926v1
www.biorxiv.org www.biorxiv.org

Federated Knowledge Retrieval Elevates Large Language Model Performance on Biomedical Benchmarks

2
1. GigaScience 29 Jan 2026
  
  in GigaScience
  
  AbstractBackground Large language models (LLMs) have significantly advanced natural language processing in biomedical research, however, their reliance on implicit, statistical representations often results in factual inaccuracies or hallucinations, posing significant concerns in high-stakes biomedical contexts.Results To overcome these limitations, we developed BTE-RAG, a retrieval-augmented generation framework that integrates the reasoning capabilities of advanced language models with explicit mechanistic evidence sourced from BioThings Explorer, an API federation of more than sixty authoritative biomedical knowledge sources. We systematically evaluated BTE-RAG in comparison to traditional LLM-only methods across three benchmark datasets that we created from DrugMechDB. These datasets specifically targeted gene-centric mechanisms (798 questions), metabolite effects (201 questions), and drug–biological process relationships (842 questions). On the gene-centric task, BTE-RAG increased accuracy from 51% to 75.8% for GPT-4o mini and from 69.8% to 78.6% for GPT-4o. In metabolite-focused questions, the proportion of responses with cosine similarity scores of at least 0.90 rose by 82% for GPT-4o mini and 77% for GPT-4o. While overall accuracy was consistent in the drug–biological process benchmark, the retrieval method enhanced response concordance, producing a greater than 10% increase in high-agreement answers (from 129 to 144) using GPT-4o.Conclusion Federated knowledge retrieval provides transparent improvements in accuracy for large language models, establishing BTE-RAG as a valuable and practical tool for mechanistic exploration and translational biomedical research.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giag007), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 2: Sajib Acharjee Dip
  
  This paper introduces BTE-RAG, a system that combines large language models with biomedical knowledge from BioThings Explorer. Tested on three benchmarks built from DrugMechDB (genes, metabolites, and drug-process links), it shows clear accuracy gains compared to using LLMs alone.
  
  Strengths: The work demonstrates that retrieval improves both small and large models, suggesting cost-efficiency and scalability. This paper also curated multi-scale QA datasets (gene, metabolite, drug) from DrugMechDB provide structured, reproducible evaluation.
  
  Weaknesses: 1. This dual-route design is conceptually sound but too narrow a baseline. A stronger evaluation would compare against other RAG systems (PubMed-based retrieval, BiomedRAG, SPOKE-RAG) instead of just "LLM-only." 2. For Entity Recognition step, using pre-annotated entities in benchmarks artificially simplifies the problem. In real-world biomedical QA, entity recognition itself is a major challenge (e.g., ambiguous drug synonyms, rare disease names). Besides, the zero-shot extraction module is described but not evaluated. The paper should report precision/recall of entity recognition to show feasibility beyond curated inputs. 3. No error analysis of BTE retrieval quality is provided. If BTE returns wrong or noisy triples, how often does this mislead the LLM? Adding experiment to show that would strengthen the study. 4. Though the authors used SOTA LLMs, however, the choice of only OpenAI GPT-4o family is narrow. No comparison with open-source biomedical LLMs (e.g., BioGPT, Meditron, PubMedBERT-RAG). Comparison with these model would increase the generalizability 5. Reliance on one source (DrugMechDB) makes evaluation narrow. The authors should demonstrate performance on at least one independent dataset (e.g., BioASQ, PubMedQA, SPOKE-based tasks) to show broader utility. 6. Cosine similarity ≥0.9 is arbitrary; should provide ROC/AUC or threshold sensitivity. 7. Benchmarks enforce exactly one correct gene, metabolite, or drug per question. Real mechanisms often involve multiple parallel or interacting entities. The single-answer design hides biological complexity and creates an artificial task. 8. Ground truth relies on exact HGNC, CHEBI, or DrugBank IDs. Why the ambiguities (synonyms, deprecated IDs, overlapping terms) are filtered out rather than addressed? This may bias the dataset toward easier, cleaner cases. 9. The paper cited recent biomedical RAG systems such as BiomedRAG, GeneTuring but didn't compare with them (e.g., BiomedRAG). BioRAG (2024) is also highly relevant. These works are highly relevant baselines, showing retrieval from knowledge graphs, APIs, or literature, and including them in comparison would better position BTE-RAG within the current state of the art and highlight its unique contributions.
2. GigaScience 29 Jan 2026
  
  in GigaScience
  
  AbstractBackground Large language models (LLMs) have significantly advanced natural language processing in biomedical research, however, their reliance on implicit, statistical representations often results in factual inaccuracies or hallucinations, posing significant concerns in high-stakes biomedical contexts.Results To overcome these limitations, we developed BTE-RAG, a retrieval-augmented generation framework that integrates the reasoning capabilities of advanced language models with explicit mechanistic evidence sourced from BioThings Explorer, an API federation of more than sixty authoritative biomedical knowledge sources. We systematically evaluated BTE-RAG in comparison to traditional LLM-only methods across three benchmark datasets that we created from DrugMechDB. These datasets specifically targeted gene-centric mechanisms (798 questions), metabolite effects (201 questions), and drug–biological process relationships (842 questions). On the gene-centric task, BTE-RAG increased accuracy from 51% to 75.8% for GPT-4o mini and from 69.8% to 78.6% for GPT-4o. In metabolite-focused questions, the proportion of responses with cosine similarity scores of at least 0.90 rose by 82% for GPT-4o mini and 77% for GPT-4o. While overall accuracy was consistent in the drug–biological process benchmark, the retrieval method enhanced response concordance, producing a greater than 10% increase in high-agreement answers (from 129 to 144) using GPT-4o.Conclusion Federated knowledge retrieval provides transparent improvements in accuracy for large language models, establishing BTE-RAG as a valuable and practical tool for mechanistic exploration and translational biomedical research.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giag007), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 1: Christopher Tabone
  
  Dear Authors,
  
  Thank you for the opportunity to review "Federated Knowledge Retrieval Elevates Large Language Model Performance on Biomedical Benchmarks." The paper tackles a timely and important problem: grounding large language models in mechanistic evidence to reduce unsupported claims. It does so with a thoughtful design that layers BTE-RAG over a federation of approximately 60 biomedical APIs and evaluates three complementary DrugMechDB-derived benchmarks (gene, metabolite, drug to process). The manuscript is clearly written, the technical contribution is meaningful, and the experimental results are promising.
  
  Recommendation: Major revision.
  
  Below are concrete, actionable changes that would bring the work in line with GigaScience's standards for FAIR availability, licensing, documentation, testing, and reproducibility. Many are straightforward, but together they matter for long-term reuse and auditability.
  
  1) Statistical rigor: paired inference, uncertainty, variance The manuscript reports compelling descriptive gains. Because each benchmark item is answered under both conditions (LLM-only and BTE-RAG), the study is a paired design. In paired settings, descriptive plots and point estimates are not sufficient to establish that improvements exceed sampling noise or threshold tuning. Please add paired statistical evidence that quantifies: (i) whether the gains are reliable, (ii) how large they are in practical terms, and (iii) how stable they are under repeated runs or under a fully deterministic pipeline. Gene task (binary): Report McNemar's test on the existing 2×2 tables, along with 95 percent Wilson confidence intervals for each condition and a Newcombe confidence interval for the accuracy difference. Keep the flip counts in the text.
  
  Metabolite and drug-to-process tasks (similarity): Report paired bootstrap confidence intervals or Wilcoxon signed-rank tests on per-item similarity differences (BTE-RAG minus baseline). Include a nonparametric effect size such as Cliff's delta with its confidence interval.
  
  Threshold validation: Treat the greater-than-or-equal-to 0.90 "high-fidelity" threshold as a choice that should be validated. Show sensitivity across nearby cutoffs such as 0.85, 0.90, and 0.95, and add a small blinded expert adjudication (about 50 to 100 items) to confirm that the high-cosine band corresponds to acceptable correctness.
  
  Variance or determinism: Either document end-to-end determinism (frozen retrieval caches, fixed ordering, pinned embeddings) or run at least three replicates and report mean and standard deviation.
  
  These additions convert the current descriptive story into paired inference with uncertainty and effect sizes and clarify robustness around thresholding and reproducibility.
  
  2) Benchmark scope and generalizability All three evaluations are derived from DrugMechDB, which makes the study internally consistent but also couples the tasks to a single curation philosophy and evidence distribution. Please acknowledge this limitation explicitly in the Discussion and, ideally, add an external validation on at least one independent source to demonstrate generalizability. Options include CTD (drug-gene-process links), Reactome or GO (pathway and process grounding), DisGeNET (gene-disease associations), or a lightweight question answering set sourced outside DrugMechDB. Even a modest external set of about 100 to 200 items, evaluated with the same paired protocols and identifier-based scoring, would strengthen the claim. If full external validation is not feasible for this revision, please include robustness checks such as a date-based split, entity-family holdouts, and per-source ablations.
  
  3) Licensing, attribution, and persistent identifiers The project is MIT-licensed and adapts components from BaranziniLab/KG_RAG (Apache-2.0) and SuLab/DrugMechDB (CC0-1.0). To meet license obligations and align with FAIR and the Joint Declaration of Data Citation Principles, please: (i) keep Apache-licensed code under Apache with the upstream LICENSE and NOTICE files, noting any modifications; (ii) include the CC0 dedication text for any DrugMechDB artifacts and note that CC0 provides no patent grant; (iii) archive with DOIs (GigaDB preferred?) the three benchmarks, the exact evaluation caches used in the paper, and a tagged software release of the repository; (iv) license datasets under CC0 or CC BY while keeping the code MIT; (v) add a short Data and Software Availability table listing artifact, DOI or URL, license, and version or date.
  
  4) Error analysis and degradation cases Please add a brief failure analysis focused on where BTE-RAG reduces accuracy relative to LLM-only. At minimum, report the total number and percent of right-to-wrong flips per task and include a small set of representative cases. For each example, show the input, expected and predicted outputs, the top retrieved evidence with identifiers and timestamps, and a one-line diagnosis of the likely cause (for example normalization mismatch, retrieval coverage gap, ranking or filtering that hid relevant context, or long-context truncation). A short summary that groups the main causes into two or three buckets will make the results more interpretable and point to practical fixes.
  
  5) Methodological transparency: embedding and scoring models Please add two or three sentences in Methods explaining why S-PubMedBERT-MS-MARCO is used for filtering retrieved context while a BioBERT-based model is used for semantic similarity scoring, and what advantages each provides over plausible alternatives. A brief rationale will strengthen methodological transparency.
  
  6) Reproducibility workflow and archived caches Because BTE federates live APIs, results can drift as sources update. Please archive the exact retrieval caches used in evaluation with DOIs and minimal provenance if at all possible (query identifier, subject and object identifiers, predicate, source name and version or access date, any confidence score, and a retrieval timestamp).
  
  In summary, this is a promising and well-motivated study that could make a useful contribution once the statistical evidence, FAIR availability, and reproducibility pieces are tightened as outlined above. I recommend Major Revision and am happy to re-review a revised version.
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2025.08.01.668022v1
www.biorxiv.org www.biorxiv.org

SpaceBF: Spatial coexpression analysis using Bayesian Fused approaches in spatial omics datasets

2
1. GigaScience 29 Jan 2026
  
  in GigaScience
  
  AbstractAdvances in spatial omics enable measurement of genes (spatial transcriptomics) and peptides, lipids, or N-glycans (mass spectrometry imaging) across thousands of locations within a tissue. While detecting spatially variable molecules is a well-studied problem, robust methods for identifying spatially varying co-expression between molecule pairs remain limited. We introduce SpaceBF, a Bayesian fused modeling framework that estimates co-expression at both local (location-specific) and global (tissue-wide) levels. SpaceBF enforces spatial smoothness via a fused horseshoe prior on the edges of a predefined spatial adjacency graph, allowing large, edge-specific differences to escape shrinkage while preserving overall structure. In extensive simulations, SpaceBF achieves higher specificity and power than commonly used methods that leverage geospatial metrics, including bivariate Moran’s I and Lee’s L. We also benchmark the proposed prior against standard alternatives, such as intrinsic conditional autoregressive (ICAR) and Matérn priors. Applied to spatial transcriptomics and proteomics datasets, SpaceBF reveals cancer-relevant molecular interactions and patterns of cell–cell communication (e.g., ligand–receptor signaling), demonstrating its utility for principled, uncertainty-aware co-expression analysis of spatial omics data.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giag006), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 2: Daniel Domovic
  
  Dear authors,
  
  I read your manuscript "SpaceBF: Spatial coexpression analysis using Bayesian Fused approaches in spatial omics datasets" with interest.
  
  The manuscript presents SpaceBF, a Bayesian method for detecting spatial co-expression between pairs of molecules in spatial omics data. The topic is relevant since new technologies like spatial transcriptomics, mass spectrometry imaging, and multiplex immunofluorescence produce large data but current tools for co-expression are limited. The authors try to solve this gap with a new model and they also test it on real datasets. The paper is technical, but it also gives biological examples, which is helpful for readers.
  
  The paper has many strong points. First, the idea to use Bayesian fused horseshoe prior together with MST spatial structure is new and well explained. Second, the authors apply their method on three real datasets and they show interesting biology, for example IGF2-IGF1R relation, keratin isoform consistency, and stromal ECM peptides. Third, I appreciate that the code is open on GitHub. Also, the paper compares with other methods and deals with the common problem of variance-stabilizing transform by modeling UMI counts directly with negative binomial distribution.
  
  Overall, the work is clear and well organized, but there are some points where more explanation or clarification would help. In my review I give major and minor remarks that I hope will improve the paper.
  
  Major remarks 1. Were you worried choosing MST may oversimplify spatial relationships, since many meaningful local neighborhoods may be excluded? Would the results of SpaceBF be significantly different if a different spatial graph, such as kNN, Delaunay triangulation, or kernel-based, was used instead of MST? 2. Since MST edges depend a lot on pairwise L2 distances, how stable are the results if spatial coordinates are a little noisy, or if there are tissue registration errors? 3. The model puts one molecule as outcome and the other as predictor. Are the co-expression estimates still the same if you switch roles? 4. In the Results you mention "FDR < 0.1." Can you explain which method you used for FDR? Also, are the discoveries robust if you change the threshold (for example 0.05 vs 0.1)? 5. Do the simulation parameters (lengthscale, slope, dispersion) correspond to realistic biological signal strengths and spatial scales observed in real datasets? Three values of the lengthscale l are considered, l = 3.6, 7.2, 18. Why exactly these values? What does ν=0.75 mean in terms of effect size? How does l=18 compare to real tissue lengthscales? 6. Can you describe runtime and memory for larger datasets, like 10X Visium with 5,000-20,000 spots? Is the current MCMC practical for this scale, or do you think approximate inference (like variational Bayes or INLA) is needed?
  
  Minor remark 1. How sensitive are the results to the choice of hyperparameters for the Horseshoe prior? 2. In the Results you state that keratins "co-express highly, meaning their binding patterns with any specific type 1 keratin should be similar." Please make clear that SpaceBF measures co-expression, not direct binding, so that conclusions are not overstated. 3. You mention SpatialCorr and Copulacci, but the comparison was not successful. Even if parameters were sensitive, I think one short numerical comparison in the supplement would be helpful. 4. You filter out genes with fewer than ~59 total reads (0.2 x number of spots). Can you justify the choice of this threshold and show if results are stable for other thresholds (for example 0.1x or 0.5x)? Since many ligands and receptors are lowly expressed, is there a risk of losing meaningful biology? Since the dataset has only 293 spots, thresholds can have strong effect.
2. GigaScience 29 Jan 2026
  
  in GigaScience
  
  AbstractAdvances in spatial omics enable measurement of genes (spatial transcriptomics) and peptides, lipids, or N-glycans (mass spectrometry imaging) across thousands of locations within a tissue. While detecting spatially variable molecules is a well-studied problem, robust methods for identifying spatially varying co-expression between molecule pairs remain limited. We introduce SpaceBF, a Bayesian fused modeling framework that estimates co-expression at both local (location-specific) and global (tissue-wide) levels. SpaceBF enforces spatial smoothness via a fused horseshoe prior on the edges of a predefined spatial adjacency graph, allowing large, edge-specific differences to escape shrinkage while preserving overall structure. In extensive simulations, SpaceBF achieves higher specificity and power than commonly used methods that leverage geospatial metrics, including bivariate Moran’s I and Lee’s L. We also benchmark the proposed prior against standard alternatives, such as intrinsic conditional autoregressive (ICAR) and Matérn priors. Applied to spatial transcriptomics and proteomics datasets, SpaceBF reveals cancer-relevant molecular interactions and patterns of cell–cell communication (e.g., ligand–receptor signaling), demonstrating its utility for principled, uncertainty-aware co-expression analysis of spatial omics data.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giag006), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 1: Satwik Acharyya
  
  Summary: The manuscript introduces a novel statistical framework for analyzing spa- tially varying molecular co-expression. Leveraging a Bayesian fused modeling approach, SpaceBF estimates both local (location-specific) and global (tissue-wide) co-expression pat- terns, particularly useful for studying cell-cell communication via ligand-receptor interac- tions. The method outperforms traditional geospatial metrics like bivariate Moran's I and Lee's L in terms of specificity and precision. Application of SpaceBF to spatial omics data reveals new insights into molecular interactions across various cancer types, offering a pow- erful tool for spatial omics research. The paper is nicely written, well structured, and great visualizations but I have the following comments.
  
  The authors missed a couple of key references related to co-expression analysis of spatial omics data such as JOBS (Chakrabarti et al., 2024) and SpaceX (Acharyya et al., 2022). The authors are recommended to include these references in the Introduction Section.
  
  A method related figure can be included for visual illustration of the method.
  
  In Melanoma ST data analysis, authors have used the RCTD algorithm (Cable et al., 2022) for cell-type estimation. It seems like the gene expression matrix has been used twice in the whole process: once in case of cell-type estimation and co-expression analysis afterwards. The obtained results can be highly correlated due to multiple uses of the gene expression matrix. It would be great if authors can address this issue.
  
  In the cSCC ST data analysis, BayesSpace (Zhao et al., 2021) algorithm has been used for spatial region identification. In Figure 2C, cluster numbers are provided only and those are not transferred to spatial regions. It is difficult to make spatial region specific inference without such regional annotation of clusters. The gene expression matrix is used multiple times in this case as well (spatial region identification and co-expression analysis).
  
  The spatial omcis datasets are sparse in nature. It possible that some these edges may not exist if the molecules are far apart. Authors are requested to justify the use shrinkage prior such as horseshoe rather than spike-and-slab prior.
  
  While the authors briefly mention about the associated computational costs, it is recommended to include a comparison of the computational costs for different approaches in the simulation studies. This would provide a more comprehensive understanding of the proposed method's efficiency and feasibility. It will be also interesting to see the scalability of the method for large scale datasets.
  
  To ensure the robustness of the proposed methodology, it is requested that the authors include a detailed sensitivity analysis for the selected priors and parameters.
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2025.03.29.646124v11
www.biorxiv.org www.biorxiv.org

Comparative evaluation of computational methods for reconstruction of human viral genomes

3
1. GigaScience 05 Jan 2026
  
  in GigaScience
  
  AbstractThe increasing availability of viral sequences has led to the emergence of many optimized viral genome reconstruction tools. Given that the number of new tools is steadily increasing, it is complex to identify functional and optimized tools that offer an equilibrium between accuracy and computational resources as well as the features that each tool provides. In this paper, we surveyed open-source computational tools (including pipelines) used for human viral genome reconstruction, identifying specific characteristics, features, similarities, and dissimilarities between these tools. For quantitative comparison, we create an open-source reconstruction benchmark based on viral data. The benchmark was executed using both synthetic and real datasets. With the former, we evaluated the effects to the reconstruction process of using different human viruses with simulated mutation rates, contamination and mitochondrial DNA inclusion, and various coverage depths. Each reconstruction program was also evaluated using real datasets, demonstrating their performance in real-life scenarios. The evaluation measures include the identity, a Normalized Compression Semi-Distance, and the Normalized Relative Compression between the genomes before and after reconstruction, as well as metrics regarding the length of the genomes reconstructed, computational time and resources spent by each tool. The benchmark is fully reproducible and freely available at https://github.com/viromelab/HVRS.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf159), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 3: Serghei Mangul
  
  The paper is well written and provides a valuable contribution to the field. My only concern pertains to the real data utilized, which lacks a gold standard. Consequently, I question whether the real data adds significant value to the analysis, given the absence of a gold standard. Major comments:1. What are the types of data used in the manuscript? Is it solely metagenomics data? If so, it would be beneficial to clarify this in the abstract and potentially in the title
  
  . 2. Was the real data comprised of metagenomics? It would be advantageous to include some text explaining the nature of the data
  
  . 3. In the section titled " Performance in real datasets, " It is unclear why the results of FALSCON-meta are regarded as the gold standard
  
  . Minor comments:1. The phrase "availability of viral sequences"Seemingly suggests that the author intends to reference viral sequencing data or metagenomics data. Currently, it reads as though it refers to viral reference genomes.
2. GigaScience 05 Jan 2026
  
  in GigaScience
  
  AbstractThe increasing availability of viral sequences has led to the emergence of many optimized viral genome reconstruction tools. Given that the number of new tools is steadily increasing, it is complex to identify functional and optimized tools that offer an equilibrium between accuracy and computational resources as well as the features that each tool provides. In this paper, we surveyed open-source computational tools (including pipelines) used for human viral genome reconstruction, identifying specific characteristics, features, similarities, and dissimilarities between these tools. For quantitative comparison, we create an open-source reconstruction benchmark based on viral data. The benchmark was executed using both synthetic and real datasets. With the former, we evaluated the effects to the reconstruction process of using different human viruses with simulated mutation rates, contamination and mitochondrial DNA inclusion, and various coverage depths. Each reconstruction program was also evaluated using real datasets, demonstrating their performance in real-life scenarios. The evaluation measures include the identity, a Normalized Compression Semi-Distance, and the Normalized Relative Compression between the genomes before and after reconstruction, as well as metrics regarding the length of the genomes reconstructed, computational time and resources spent by each tool. The benchmark is fully reproducible and freely available at https://github.com/viromelab/HVRS.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf159), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 2: Anton Korobeynikov
  
  Sousa et all in their article provided an attempt to review available computational methods to assemble human viruses from real and simulated data. While the review itself seems to be valuable, we believe that the exposition contains several methodological issues that renders it somehow useful. We will try to summarize these high-level issues instead of going down through small details here and there.
  
  To our surprise, the authors (who are also authors of two tools under evaluation) somehow do not distinguish different kinds of input data that is pretty important as it effectively determines the choice of the tool. Clearly, there is no silver bullet here and there is no single push-button solution that could handle universally all kinds of input data. It is very strange that e.g. authors do not distinguish between DNA and RNA viruses. The sequencing approaches for these kinds of data is very different, they have entirely different internal structure and organization and the challenges associated with assembly process are different. This is summarized well in e.g. (Grabherr et al, 2011), (Bushmanova et al, 2019) and (Meleshko et al, 2022) among the others. To add second dimension here: we can have more or less "pure" viral culture, or metagenome / metavirome, or some highly divergent metavirome (e.g. in case of HIV or other viruses undergo reverse transcription). The host contamination is more sound for DNA viruses, etc. So, to summarize - all (very complex!) variations of input data were somehow folded into a single "human viruses" title, which is really misleading. It is the properties of input data that should guide the choice of the appropriate tool.
  
  Next, the choice of tools is also somewhat questionable. Some well-known tools like PRICE or VICUNA were omitted. Ok, IVA is here and this might be enough for "classical" viral assemblies. But then generic-purpose metagenome assembler metaSPAdes is considered without other choices. How about MEGAHIT? for RNA viral data - what's about Trinity or rnaSPAdes? It was strange to see coronaSPAdes mentioned, while it is essentially rnaviralSPAdes + set of SARS-Cov 2 HMMs. Why not just rnaviralSPAdes if we already know we're not going to reconstruct coronaviral data? Another thing is that the majority of tools are tuned for a particular tasks: there are tools for quasispecies assembly, so they would aim to preserve all the variation present. Metagenomic assemblers aim to provide a backbone consensus of a metagenome. Usually assemblers for RNA data aim for the reconstruction as many transcripts as possible (so their "duplication rate" might be misleading). metaviralSPAdes aims to reconstruct full-length circular and linear viruses from complex contaminated metagenomes, so it could be very conservative, etc. It feels like the benchmarking compares something warm with something soft giving misleading guidance to the reader.
  
  Finally, it is the year 2025, but the pipeline is just a huge pile of shell scripts that install tools (sometimes outdated as far as I can see, e.g. it uses SPAdes 3.13.0 that was released more than 5 years ago) often globally, sometimes only via conda. It could hardly be named as "reproducible" pipeline: error handling is quite non-existing, if something happens in between the user might end with some partially resolved state. There are lots of frameworks and approaches developed recently that provide all necessary details like job isolation, installation, restart & checkpointing, data acquisition, etc. To put things simple: why everything is done manually via hand-written shell scripts and not based on say, Nextflow? There are lots of ready modules from nf-core that one could just reuse. Likely some ideas could be taken from https://github.com/nf-core/viralrecon/ and other pipelines available there.
3. GigaScience 05 Jan 2026
  
  in GigaScience
  
  AbstractThe increasing availability of viral sequences has led to the emergence of many optimized viral genome reconstruction tools. Given that the number of new tools is steadily increasing, it is complex to identify functional and optimized tools that offer an equilibrium between accuracy and computational resources as well as the features that each tool provides. In this paper, we surveyed open-source computational tools (including pipelines) used for human viral genome reconstruction, identifying specific characteristics, features, similarities, and dissimilarities between these tools. For quantitative comparison, we create an open-source reconstruction benchmark based on viral data. The benchmark was executed using both synthetic and real datasets. With the former, we evaluated the effects to the reconstruction process of using different human viruses with simulated mutation rates, contamination and mitochondrial DNA inclusion, and various coverage depths. Each reconstruction program was also evaluated using real datasets, demonstrating their performance in real-life scenarios. The evaluation measures include the identity, a Normalized Compression Semi-Distance, and the Normalized Relative Compression between the genomes before and after reconstruction, as well as metrics regarding the length of the genomes reconstructed, computational time and resources spent by each tool. The benchmark is fully reproducible and freely available at https://github.com/viromelab/HVRS.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf159), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 1: Levente Laczkó
  
  I reviewed the manuscript titled "An evaluation of computational methods for reconstruction of human viral genomes" by Sousa et al. The authors reviewed different tools for the reconstruction of viral genomes and developed a benchmarking framework to measure the performance of the different tools. The benchmarking was performed with both synthetic and real sequencing data, and the authors provide recommendations for different scenarios. The benchmarking framework developed with Bash is also made available on GitHub, providing the scientific community a good example to increase reproducibility. The analysis steps are also clearly described in the manuscript. Independent benchmarks, such as presented in the manuscript, are valuable contributions to the scientific literature and help to select the right tool for different tasks. The manuscript is clearly structured and well written, and the results are appropriately presented with rich supplementary material. I definitely recommend the publication of the manuscript in GigaScience. However, I have some questions that I think should be addressed before publishing the final version to further improve the manuscript.
  
  The authors describe that multiple strains may be present within a single infection. Indeed, the variability of strains within a single infection may be particularly important for some viruses. QuRe, ViSpA, SAVAGE and ViQUF are explicitly designed to find quasispecies. Are there any other tools in the benchmark that can predict whether samples are heterogeneous (or whose results can be used to infer this)?
  
  The authors have used the human mitochondrion as a source of contamination to test whether the tools are sensitive to it. Is there a reason why only the mitochondrion was used for this test and other, perhaps random, human DNA fragments were not?
  
  The error rate can strongly influence the accuracy of reference-based genome reconstructions. Has the effect of error rate been tested or could it affect the results, e.g. are there any tools in the benchmark that are less sensitive to higher error rates?
  
  In the synthetic dataset, the coverage ranged from 2-40×. This range represents scenarios where the viral copy number is low, but especially if the viral DNA was enriched before sequencing, the coverage could be much higher. Is there a reason to specifically choose 40x coverage as the highest coverage value? I agree that low coverage is a difficult challenge, but checking the performance of different tools at high read depth can help readers to choose the right tool for these use cases if there is a difference in the performance of the tools at e.g. >100x coverage.
  
  The authors correctly describe that the complexity of genomes can be a challenge for accurate genome reconstruction. Assessing the complexity (e.g. repetitive content ratio, GC ratio) of the genomes used in the synthetic dataset can add additional value to the results by showing how different tools perform on genomes of different complexity.
  
  Some reference-based tools (QVG, TRACESPipe, TRACESPipeLite and V-pipe) produced results with many gaps. Could the different approach be a reason for how they deal with low coverage regions? QVG, for example, masks positions with low sequencing depth to increase the specificity of the search for polymorphisms. Can the gaps be explained by the variation in sequencing depth, i.e. could the gaps be linked to genomic regions with very low or very high sequencing depth?
  
  I agree that benchmarking real datasets without the correct original sequence is a difficult task. I believe that showing the coverage and completeness (e.g. the ratio of the reconstructed length of the reference genome) can be an additional and useful information for the reader to choose the right tool for different tasks. The expected length of the viral genomes could be determined by the length of the reference genomes used, based on the classification of FALCON-meta, and in the case of de novo pipelines, the scaffolds that do not match the references could be classified using e.g. kraken2. This could show how complete the reconstructed genomes are and whether there are other viral genomes in the samples that FALCON-meta missed but still represent valuable information. Supplementary Figures S143-S146 show the number of reconstructed bases with and without gaps, but I think that this experiment should be emphasised more in the main text and that the ratio of reconstructed bases to the expected genome sizes might be more informative than just the total number of reconstructed base pairs.
  
  1) Are the methods appropriate to the aims of the study, are they well described, and are necessary controls included? Yes
  
  2) Are the conclusions adequately supported by the data shown? Yes
  
  3) Please indicate the quality of language in the manuscript. Does it require a heavy editing for language and clarity? The language is well understandable
  
  4) Are you able to assess all statistics in the manuscript, including the appropriateness of statistical tests used? Yes
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2025.01.17.633368v1
www.biorxiv.org www.biorxiv.org

Haplotype-resolved chromosome-level genome assemblies of four Diamesa species reveal the genetic basis of cold tolerance and high-altitude adaptations in arctic chironomids

2
1. GigaScience 05 Jan 2026
  
  in GigaScience
  
  ABSTRACTArctic and alpine insects face extreme environmental stressors, yet the genomic basis of their adaptation remains poorly understood. Here, we present the first haplotype-resolved, chromosome-level genomes for four species of Diamesa (Diptera: Chironomidae), a genus of cold-adapted midges inhabiting glacial and high-altitude freshwater ecosystems. Using PacBio HiFi sequencing and Hi-C scaffolding, we assembled high-quality genomes with chromosome-level resolution and high k-mer completeness. Phylogenomic analyses support Diamesinae as sister to other Chironomidae except Podonominae, and genomic comparisons provide evidence for introgression between the evolutionary distinct D. hyperborea and D. tonsa. Comparative genomic analyses across 20 Diptera species revealed significant gene family contractions in Diamesa associated with oxygen transport and metabolism, suggesting adaptations to high-altitude, low-oxygen environments. Conversely, expansions were detected in histone-related and Toll-like receptor gene families, likely enhancing chromatin remodeling and immune regulation under cold stress. A single gene family encoding glucose dehydrogenase was significantly expanded across all cold-adapted species studied, implicating its role in cryoprotectant synthesis and oxidative stress mitigation. Notably, Diamesa species exhibit the largest gene family contraction at any node, with minimal overlap in expansions with other cold-adapted Diptera, indicating lineage-specific adaptation. Our findings support the hypothesis that genome size condensation and selective gene family changes underpin survival in cold environments. These genome assemblies represent a valuable resource for investigating adaptation, speciation, and conservation in cold-specialist insects. Future work integrating gene expression and population genomics will further illuminate the evolutionary resilience of Diamesa in a warming world.Competing Interest StatementThe authors have declared no competing interest.Footnotes↵# Indicates shared senior authorshipFunder Information DeclaredThe Research Council of Norway, https://ror.org/00epmv149, 326819, 270068
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf160), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 2: Chao Bian
  
  This paper, entitled 'Haplotype-resolved chromosome-level genome assemblies of four Diamesa species reveal the genetic basis of cold tolerance and high-altitude adaptations in arctic chironomids', provided four chromosome-level genomes of Diamesa by using PCABIO HIFI. Phylogenetics and gene families were identified in this study.
  
  However, I strongly suggest authors to show then expansion result by using some figures. Only expansion introduction is too weak.
  
  The genome size condensation is also limited and have no sufficient evidence.
  
  Some suggestions: The format of the abstract should be largely revised.
  
  Line 35-37, this sentence is too heavy to understanding, split it to be clearer.
  
  Line 123-126, the cycle for temperature profile can be removed. I think no need to introduce this too detailed.
  
  Line 147, "Prior" should be 'prior'.
  
  Please add the "RRID" and version for the used software.
  
  Line 191, 'samtools' should be 'SAMtools'.
  
  The detailed parameters also need to be shown.
  
  For the table 2, this table is bit in disorder. Please move the Hifi read covergage, Hic read coverage and Consensus quanlity, Kmer both, heterozygosity to half bottom of this table.
  
  Line 327, 337, Align subheadings to the left margin. Do not indent.
  
  Line 338, 'between' revised to 'among'.
  
  I suggested the authors to initial a figure to show the expansion of Glucose dehydrogenase.
  
  Line 359, 1066 should be '1,066'.
  
  For the data records, where did the sequenced deposited? There was no NCBI project ID in this study.
2. GigaScience 05 Jan 2026
  
  in GigaScience
  
  ABSTRACTArctic and alpine insects face extreme environmental stressors, yet the genomic basis of their adaptation remains poorly understood. Here, we present the first haplotype-resolved, chromosome-level genomes for four species of Diamesa (Diptera: Chironomidae), a genus of cold-adapted midges inhabiting glacial and high-altitude freshwater ecosystems. Using PacBio HiFi sequencing and Hi-C scaffolding, we assembled high-quality genomes with chromosome-level resolution and high k-mer completeness. Phylogenomic analyses support Diamesinae as sister to other Chironomidae except Podonominae, and genomic comparisons provide evidence for introgression between the evolutionary distinct D. hyperborea and D. tonsa. Comparative genomic analyses across 20 Diptera species revealed significant gene family contractions in Diamesa associated with oxygen transport and metabolism, suggesting adaptations to high-altitude, low-oxygen environments. Conversely, expansions were detected in histone-related and Toll-like receptor gene families, likely enhancing chromatin remodeling and immune regulation under cold stress. A single gene family encoding glucose dehydrogenase was significantly expanded across all cold-adapted species studied, implicating its role in cryoprotectant synthesis and oxidative stress mitigation. Notably, Diamesa species exhibit the largest gene family contraction at any node, with minimal overlap in expansions with other cold-adapted Diptera, indicating lineage-specific adaptation. Our findings support the hypothesis that genome size condensation and selective gene family changes underpin survival in cold environments. These genome assemblies represent a valuable resource for investigating adaptation, speciation, and conservation in cold-specialist insects. Future work integrating gene expression and population genomics will further illuminate the evolutionary resilience of Diamesa in a warming world.Competing Interest StatementThe authors have declared no competing interest.Footnotes↵# Indicates shared senior authorshipFunder Information DeclaredThe Research Council of Norway, https://ror.org/00epmv149, 326819, 270068
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf160), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 1: shijun xiao
  
  Using the first haplotype resolved, chromosome-level genomes for four species of Diamesa, authors provide a valuable resource for investigating adaptation, speciation, and conservation in cold-specialist insects and they analyized the genomic reason, including significant gene family contractions and expansions, for their cold environment adaptations. It effectively highlights the ecological importance of Diamesa midges and the novelty of generating haplotype-resolved, chromosome-level genomes, providing a strong rationale for the study. I think the manuscript could be accepted after authors address the following minor issues: 1．The QV values in Table 2 were evaluated using Hi-C data. Could you clarify the rationale for this approach? In general, Hi-C data are not suitable for assessing genome quality; instead, whole-genome short reads are more appropriate for such evaluations. The relatively low QV value of 20 might be due to the use of Hi-C data, as high-quality short-read evaluations typically yield QV values around 30. If short-read data are available, I recommend re-evaluating the genome quality with Merqury using those reads. If short reads are not available, please provide a reasonable justification for the use of Hi-C data, or retain only the QV evaluation based on HiFi read alignments. 2．The title of the manuscript mentions that the genome assemblies are at the chromosome level, and the Conclusions section also refers to chromosome numbers. It would be helpful to include the number of chromosomes in Table 2, which would provide a more intuitive representation of chromosome features and highlight differences among the species. 3．Based on Supplementary Figure 2 and Table 2, it can be observed that the haplotype carrying the fourth scaffold has a slightly larger genome size and more protein-coding genes than the other haplotype, although the difference is not very pronounced. Could the authors clarify whether this is due to a biological feature of Diamesinae species or a consequence of the assembly process? 4．In addition, BUSCO results are only reported as overall completeness, without distinguishing between single-copy and duplicated genes. It would be helpful to provide this information, as it would give a more complete picture of genome quality and potential assembly artifacts.
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2025.07.30.667619v1
www.biorxiv.org www.biorxiv.org

Open RGB Imaging Workflow for Morphological and Morphometric Analysis of Fruits using AI: A Case Study on Almonds

3
1. GigaScience 05 Jan 2026
  
  in GigaScience
  
  AbstractHigh-throughput phenotyping is addressing the current bottleneck in phenotyping within breeding programs. Imaging tools are becoming the primary resource for improving the efficiency of phenotyping processes and providing large datasets for genomic selection approaches. The advent of AI brings new advantages by enhancing phenotyping methods using imaging, making them more accessible to breeding programs. In this context, we have developed an open Python workflow for analyzing morphology and heritable morphometric traits using AI, which can be applied to fruits and other plant organs. This workflow has been implemented in almond (Prunus dulcis), a species where efficiency is critical due to its long breeding cycle. Over 25,000 kernels, more than 20,000 nuts, and over 600 individuals have been phenotyped, making this the largest morphological study conducted in almond. As result, new heritable morphometric traits of interest have been identified. These findings pave the way for more efficient breeding strategies, ultimately facilitating the development of improved cultivars with desirable traits.Competing Interest StatementThe authors have declared no competing interest.Footnotes https://github.com/jorgemasgomez/almondcv2 Abbreviations:GPUGraphics Processing UnitYOLOYou Only Look OnceSAMSegment Anything ModelROIRegion of InterestFunder Information DeclaredMinisterio de Ciencia y Universidades, España, PID2021-127421OB-I00, FPU20/00614Fundación Séneca
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf157), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 3: Yu Jiang
  
  The present study entitled "Open RGB Imaging Workflow for Morphological and Morphometric Analysis of Fruits using AI: A Case Study on Almonds" reported the development of a Python-based image analysis pipeline that extract morphological traits of almond nut shells and kernels. A case study was conducted to use the developed pipeline to analyze breeding populations of 665 genotypes and extract both general morphology traits such as height, length, area, aspect ratio, etc. and specialized traits for almond such as width at three heights, vertical and horizontal symmetry, etc. Further, each nut shell or kernel was weighed, so models were established to use the weight and morphological traits to predict the thickness of each nut shell or kernel. In addition to morphological traits, morphometric (or shape) was extracted for each nut shell or kernel. Clustering analysis was performed on the morphometric traits to identify variability among genotypes. To further validate the efficacy of the extracted traits, broad-sense heritability was calculated and used as a criterion.
  
  The major contribution of this study is the integration of different components (e.g., camera calibration, image segmentation, and morphological/morphometric trait extraction, etc.) as a user-accessible, open-source Python implementation for the plant breeding community, especially for almond breeders. However, there several aspects that could be further improved.
  
  First, the present study showed the most number of samples that were phenotyped by the proposed pipeline among recent efforts on almond nut shell and/or kernel phenotyping. However, there was no clear evidence to demonstrate direct benefits to ongoing almond breeding. Certain traits (e.g., aspect ratio, tip/top/side curvatures) could be included in a breeding program, but what's the significance of including these traits in breeding programs. Are they crucial to either improve the productivity, quality, or other management practices or processing practices for the almond industry, especially given breeding context?
  
  Second, the pipeline uses deep learning-based segmentation which is powerful to handle complex background. Based on the limited figures or example images in the GitHub repo, the background is mostly single colored (e.g., white or black) without appearances that may confuse even conventional segmentation, especially if image color is calibrated. Assuming most of the almond nut shell and kernel analyses would be done in a laboratory condition, it is not convincing why conventional segmentation methods may not be preferred if both illumination and camera configuration can be well controlled. Ultimately, the question is whether it is worthy the effort of labeling hundreds of images to fine-tune a deep learning segmentation model compared to a careful hardware-software design to make operation more efficient. Or with the simplified background, vision foundation models such as SAM will be sufficient.
  
  Third, in the Introduction section, some technical statements should be revised to make them accurate. For example, image segmentation is a core computer vision task rather than relying on computer vision algorithms. One-stage and two-stage strategies are used to differentiate models for object detection not image segmentation. Further, Faster RCNN is an object detection model and cannot do image segmentation. It is highly recommended that the authors could find a computer science or engineering colleague to proofread the technical statements to ensure the accuracy.
  
  Last, it is appreciated the authors effort on making an open-source software for the community. However, the dataset can be equally important to advance the scientific discovery and technology development. Is there any plan to make the dataset publicly available to help facilitate the development of additional computer vision algorithms for almond phenotyping?
2. GigaScience 05 Jan 2026
  
  in GigaScience
  
  AbstractHigh-throughput phenotyping is addressing the current bottleneck in phenotyping within breeding programs. Imaging tools are becoming the primary resource for improving the efficiency of phenotyping processes and providing large datasets for genomic selection approaches. The advent of AI brings new advantages by enhancing phenotyping methods using imaging, making them more accessible to breeding programs. In this context, we have developed an open Python workflow for analyzing morphology and heritable morphometric traits using AI, which can be applied to fruits and other plant organs. This workflow has been implemented in almond (Prunus dulcis), a species where efficiency is critical due to its long breeding cycle. Over 25,000 kernels, more than 20,000 nuts, and over 600 individuals have been phenotyped, making this the largest morphological study conducted in almond. As result, new heritable morphometric traits of interest have been identified. These findings pave the way for more efficient breeding strategies, ultimately facilitating the development of improved cultivars with desirable traits.Competing Interest StatementThe authors have declared no competing interest.Footnotes https://github.com/jorgemasgomez/almondcv2 Abbreviations:GPUGraphics Processing UnitYOLOYou Only Look OnceSAMSegment Anything ModelROIRegion of InterestFunder Information DeclaredMinisterio de Ciencia y Universidades, España, PID2021-127421OB-I00, FPU20/00614Fundación Séneca
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf157), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 2: Qi Wang
  
  We would like to thank you for submitting your manuscript to our journal.The manuscript proposes an AI-powered open RGB imaging workflow for morphological and morphometric analysis of fruits or other plant organs, with almonds as a case study. The workflow, developed in Python, covers the full pipeline from image pre-processing, segmentation model development and deployment, to trait measurement and analysis. It aims to improve the efficiency and accuracy of phenotyping in breeding programs by addressing the limitations of traditional methods, such as their time-consuming and labor-intensive nature. However, there are still the following problems in this paper that need further improvement: 1.Table formatting: Some of the tables in the manuscript do not follow the formatting standards of the journal. The authors are encouraged to revise them accordingly to ensure clarity, consistency, and ease of understanding. 2.Formula presentation: Certain mathematical formulas are not clearly formatted and appear disorganized. The authors should re-typeset the equations to improve readability and provide clearer explanations for each formula. 3.Introduction: The introduction could be strengthened by more thoroughly explaining the relationship between phenotypic data and breeding. The authors may also discuss how phenotyping data supports genomic selection and accelerates breeding via high-throughput workflows. 4.Methods section: While the paper clearly explains how morphological traits and kernel thickness are measured, it does not sufficiently explain how this data contributes to breeding decisions. The authors should elaborate on how the extracted traits are applied in practical breeding or selection strategies. 5.Lack of algorithmic novelty: While the integration of existing tools is commendable, the core methods used (e.g., YOLO, SAHI) are based on publicly available models, without introducing new algorithmic components or comparative ablation studies. The authors are advised to clarify the unique contribution of their workflow, especially in terms of engineering integration or practical usability. 6.Limited evaluation metrics: The performance of segmentation models is only reported using error percentage. The inclusion of standard metrics such as IoU, Precision, Recall, and F1-score would allow for a more comprehensive evaluation and comparison across models (see LLRL methods). 7.Figures and captions: Currently, figure images and their descriptions are placed separately, which may reduce readability. It is recommended to place figure captions immediately beneath or alongside the figures to enhance the paper's coherence and user-friendliness. 8.Trait extension suggestions: In order to enhance the expressiveness and resolution of phenotypic trait modeling, authors are advised to refer to the relevant research on extracting fine-grained phenotypic features in plant images in recent years. For example, PlanText proposed a progressive visual guidance strategy to help improve the modeling quality of phenotypic traits in images. Therefore, I would like to give a "Major Revision" recommendation.
3. GigaScience 05 Jan 2026
  
  in GigaScience
  
  AbstractHigh-throughput phenotyping is addressing the current bottleneck in phenotyping within breeding programs. Imaging tools are becoming the primary resource for improving the efficiency of phenotyping processes and providing large datasets for genomic selection approaches. The advent of AI brings new advantages by enhancing phenotyping methods using imaging, making them more accessible to breeding programs. In this context, we have developed an open Python workflow for analyzing morphology and heritable morphometric traits using AI, which can be applied to fruits and other plant organs. This workflow has been implemented in almond (Prunus dulcis), a species where efficiency is critical due to its long breeding cycle. Over 25,000 kernels, more than 20,000 nuts, and over 600 individuals have been phenotyped, making this the largest morphological study conducted in almond. As result, new heritable morphometric traits of interest have been identified. These findings pave the way for more efficient breeding strategies, ultimately facilitating the development of improved cultivars with desirable traits.Competing Interest StatementThe authors have declared no competing interest.Footnotes https://github.com/jorgemasgomez/almondcv2 Abbreviations:GPUGraphics Processing UnitYOLOYou Only Look OnceSAMSegment Anything ModelROIRegion of InterestFunder Information DeclaredMinisterio de Ciencia y Universidades, España, PID2021-127421OB-I00, FPU20/00614Fundación Séneca
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf157), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 1: Yuvraj Chopra
  
  The methods described in this article represent a useful tool for fast and reliable morphometric analysis of almonds with potential applications in fruits. The pipeline is technically sound, and publicly available workflow will advance the adoption of this technology. However, there are critical concerns which needs to be addressed before the manuscript could be further proceeded for publication in the journal - Major Comments - 1. Authors claim this technique as a new phenotyping tool with breakthrough implications; however, I object to this claim. Numerous studies have utilized this technique in plant phenotyping to the extent that labeling it as a new phenotyping tool may not be ideal. Additionally, for kernel or seed morphometrics, a wide array of user-friendly, open-source tools have already been developed and are readily available, for example, - SeedExtractor https://www.frontiersin.org/journals/plant-science/articles/10.3389/fpls.2020.581546/full#h3 - SmartGrain https://academic.oup.com/plphys/article/160/4/1871/6109568 - GrainScan https://link.springer.com/article/10.1186/1746-4811-10-23 - PlantCV https://peerj.com/articles/4088/ These tools and a lot of other options can be readily used for almond kernel morphometrics. Authors are requested to discuss/compare advantages/performance of their model with SeedExtractor, SmartGrain, and GrainScan. 2. Within horticultural crops, workflows and studies (not acknowledged in this article) are available that can be adapted or modified to do the same thing. For example - publications from as early as 2020 used machine learning models to measure size and mass of almonds, however, this relevant study was not acknowledged by the authors https://onlinelibrary.wiley.com/doi/full/10.1111/jfpe.13374 . Discuss how the presented method is better than the aforementioned article, justify the claim 'breakthrough'. 3. The authors claim successfully testing the pipeline for apples and strawberries. Information for fruit size can be extracted from 2D images; however, the example results show only length, width, circularity, and ellipse ratio. How do these parameters assist fruit breeders? Since it is segmentation based classification using a reference scale, the aforementioned tools particularly SeedExtractor can generate similar results. Does it qualify the tool for integration into fruit crops breeding pipeline? Moreover, fruit breeders require on-tree analysis, recent advancements have enabled 3D sensing for significantly better detection particularly using cost effective RGB-D cameras.
  
  Minor Comments - 1. Change title - This study uses deep learning models which are a type of AI. AI is a broader term. Additionally, the potential for utility in fruit breeding pipeline appears to be limited. Suggested - Open RGB imaging workflow for morphological and morphometric analysis of almond kernels using deep learning. 2. The video tutorial showed using YOLOv8, mention it in the methods. Add information for all settings used in CVAT.
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2025.05.05.652179v1
www.biorxiv.org www.biorxiv.org

A sulfatide-centered ultra-high resolution magnetic resonance MALDI imaging benchmark dataset for MS1-based lipid annotation tools

2
1. GigaScience 05 Jan 2026
  
  in GigaScience
  
  ABSTRACTSpatial ‘omics techniques are indispensable for studying complex biological systems and for the discovery of spatial biomarkers. While several current matrix-assisted laser desorption/ionization (MALDI) mass spectrometry imaging (MSI) instruments are capable of localizing numerous metabolites at high spatial and spectral resolution, the majority of MSI data is acquired at the MS1 level only. Assigning molecular identities based on MS1 data presents significant analytical and computational challenges, as the inherent limitations of MS1 data preclude confident annotations beyond the sum formula level. To enable future advancements of computational lipid annotation tools, well-characterized benchmark - or ground truth - datasets are crucial, which exceed the scope of synthetic data or data derived from mimetic tissue models. To this end, we provide two sulfatide-centered, biology-driven magnetic resonance MSI (MR-MSI) datasets at different mass resolving powers that characterize lipids in a mouse model of human metachromatic dystrophy. This data includes an ultra-high-resolution (R ∼1,230,000) quantum cascade laser mid-infrared imaging-guided MR-MSI dataset that enables isotopic fine structure analysis and therefore enhances the level of confidence substantially. To highlight the usefulness of the data, we compared 118 manual sulfatide annotations with the number of decoy database-controlled sulfatide annotations performed in Metaspace (67 at FDR < 10%). Overall, our datasets can be used to benchmark annotation algorithms, validate spatial biomarker discovery pipelines, and serve as a reference for future studies that explore sulfatide metabolism and its spatial regulation.Competing Interest StatementBruker Daltonics co-funded the BMBF-funded projects Drugs4Future and DrugsData within the framework M2Aind, as mandated by BMBF, but did not influence this study. All other authors declare no competing interests.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf150), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 2: Hikmet Budak
  
  I believe that the dataset produced is a great contribution to the community. My major concerns are as follows: 1. The data described is good but please clarify how would be solution the discrepancy between the manual annotations and the computational annotations and annotations quality for he sulfatide-centered MSI dataset, challenges? 2. Please remove too old references unless they are pioneer and replace with the new ones. 3. Please try to add some of figures as supplementary instead of text, 4. algorithm is not fully optimized or not? 5. How did you recover the missing annotations? Please clarify/elabroate this
  
  Would be happy to review after revisions.
2. GigaScience 05 Jan 2026
  
  in GigaScience
  
  ABSTRACTSpatial ‘omics techniques are indispensable for studying complex biological systems and for the discovery of spatial biomarkers. While several current matrix-assisted laser desorption/ionization (MALDI) mass spectrometry imaging (MSI) instruments are capable of localizing numerous metabolites at high spatial and spectral resolution, the majority of MSI data is acquired at the MS1 level only. Assigning molecular identities based on MS1 data presents significant analytical and computational challenges, as the inherent limitations of MS1 data preclude confident annotations beyond the sum formula level. To enable future advancements of computational lipid annotation tools, well-characterized benchmark - or ground truth - datasets are crucial, which exceed the scope of synthetic data or data derived from mimetic tissue models. To this end, we provide two sulfatide-centered, biology-driven magnetic resonance MSI (MR-MSI) datasets at different mass resolving powers that characterize lipids in a mouse model of human metachromatic dystrophy. This data includes an ultra-high-resolution (R ∼1,230,000) quantum cascade laser mid-infrared imaging-guided MR-MSI dataset that enables isotopic fine structure analysis and therefore enhances the level of confidence substantially. To highlight the usefulness of the data, we compared 118 manual sulfatide annotations with the number of decoy database-controlled sulfatide annotations performed in Metaspace (67 at FDR < 10%). Overall, our datasets can be used to benchmark annotation algorithms, validate spatial biomarker discovery pipelines, and serve as a reference for future studies that explore sulfatide metabolism and its spatial regulation.Competing Interest StatementBruker Daltonics co-funded the BMBF-funded projects Drugs4Future and DrugsData within the framework M2Aind, as mandated by BMBF, but did not influence this study. All other authors declare no competing interests.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf150), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 1: Morteza Akbari
  
  This manuscript by Gruber et al. provides a Data Note detailing a high-value, sulfatide-focused benchmark dataset for the mass spectrometry imaging (MSI) community. The project is well thought out, technically advanced, and directly meets a major need for biologically relevant, deeply characterized ground-truth data to test MS1-level metabolite annotation software. It is a big technical achievement to create an ultra-high-resolution dataset (R∼1,230,000) with a 7T FT-ICR instrument. The use of isotopic fine structure (IFS) to boost annotation confidence is a major strength. Using QCL-MIR imaging strategically to guide the MSI acquisition is a smart and effective way to do things. It's great that the authors are committed to FAIR principles.
  
  The writing in the manuscript is excellent, and the data is very good. It makes a big difference in the field. There are, however, several changes that should be made to make it clearer, more scientifically complete, and more useful as a stand-alone benchmark resource for the community. The following points are given to help make the manuscript stronger for publication.
  
  Major Revisions
  
  Provision of the "Ground Truth" Annotation List: The benchmark dataset is the most important part of this Data Note. The manuscript's supplementary information, on the other hand, doesn't seem to have the final, curated list of manual annotations that make up the "ground truth." For this dataset to be truly reusable for benchmarking third-party software, it needs another table. This table should show all of the manually annotated sulfatides for each replicate, along with their experimental m/z, proposed sum formula, lipid annotation, mass error (ppm), and a way to tell if IFS was used to confirm them.
  
  Strengthening the "Ground Truth" Justification: The manuscript depends on an earlier publication (Ref) to validate the sulfatide structures using MS/MS. It is acceptable to reference previous work, but a benchmark Data Note should be as self-sufficient as possible. Please add a short paragraph to the "Data Validation and Quality Control" section that sums up the main MS/MS fragmentation evidence from Ref that backs up the sulfatide identifications. This will give users of the dataset a more complete and clear chain of evidence.
  
  Deeper Analysis of Automated Annotation Discrepancies: The comparison with Metaspace shows how important this dataset is by showing that even a top-of-the-line tool can't annotate 14 high-confidence sulfatides. The discussion needs to be longer so that it can look at
  
  why these failures could be happening. Please explain why Metaspace's scoring algorithm, which only looks at the four most intense isotopic peaks, might not work well with this kind of ultra-high-resolution data where low-intensity IFS peaks (like 34 S) are very important. Talking about how future algorithms could make better use of this information would make the paper much more useful and help with the development of new tools.
  
  Minor Revisions
  
  Clarification of Table 1: The row headers for the R2 dataset ("all" vs. "QCL-MIR-guided") are slightly confusing, as all R2 data is QCL-MIR-guided. Please revise these for clarity (e.g., "Total Annotations in ROIs" and "Annotations with Confirmed IFS Evidence").
  
  Definition of "Internal Error": The legend for Figure 1g should include a brief definition or reference for how "internal error" was calculated to ensure the metric is fully understood.
  
  Confirmation of Database Contents: In the Methods section, please add a sentence explicitly confirming that all manually annotated sulfatide species were included in the custom database file used for the Metaspace analysis. This is a crucial detail for a fair comparison.
  
  Explicit Statement of Dataset Limitations: In the "Re-use Potential" section, it would be beneficial to explicitly state the inherent trade-off of the ultra-high-resolution approach. Please add a sentence acknowledging that the dataset is optimized for high-confidence annotation and that this comes at the cost of reduced sensitivity and comprehensive spatial coverage compared to a standard MSI experiment.
  
  Link to Custom Database: The Methods section mentions the creation of a custom database of 780 theoretical sulfatides. Please explicitly state in the text that this database is available as Supplementary Dataset 3.
  
  Addressing these points will significantly enhance the manuscript's value and ensure its lasting impact as a key resource for the computational mass spectrometry community.
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2025.10.20.683394v1
www.biorxiv.org www.biorxiv.org

An Integrative Multi-Omics Random Forest Framework for Robust Biomarker Discovery

3
1. GigaScience 05 Jan 2026
  
  in GigaScience
  
  ABSTRACTHigh-throughput technologies now produce a wide array of omics data, from genomic and transcriptomic profiles to epigenomic and proteomic measurements. Integrating these diverse data types can yield deeper insights into the biological mechanisms driving complex traits and diseases. Yet, extracting key shared biomarkers from multiple data layers remains a major challenge. We present a multivariate random forest (MRF)–based framework enhanced by a novel inverse minimal depth (IMD) metric for integrative variable selection. By assigning response variables to tree nodes and employing IMD to rank predictors, our approach efficiently identifies essential features across different omics types, even when confronted with high-dimensionality and noise. Through extensive simulations and analyses of multi-omics datasets from The Cancer Genome Atlas, we demonstrate that our method outperforms established integrative techniques in uncovering biologically meaningful biomarkers and pathways. Our findings show that selected biomarkers not only correlate with known regulatory and signaling networks but can also stratify patient subgroups with distinct clinical outcomes. The method’s scalable, interpretable, and user-friendly implementation ensures broad applicability to a range of research questions. This MRF-based framework advances robust biomarker discovery and integrative multi-omics analyses, accelerating the translation of complex molecular data into tangible biological and clinical insights.Competing Interest StatementThe authors have declared no competing interest.FootnotesAuthor Name Correction and Documentation Update.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf148), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 1: Moran Chen
  
  This manuscript presents a novel multivariate random forest (MRF) framework enhanced by the inverse minimal depth (IMD) metric for integrative multi-omics biomarker discovery. The authors clearly demonstrate the robustness and superiority of the proposed methods through comprehensive simulation studies and validation on TCGA datasets. The manuscript provides clear methodological explanations, offering valuable insights into its practical utility. I recommend accepting the manuscript after minor revisions. Minor Concern: 1. Biological Interpretation Depth: While the authors identified biologically relevant biomarkers, the biological interpretations remain somewhat superficial. A deeper exploration of novel or less-known biomarkers in the context of disease mechanisms would strengthen the biological relevance of the findings. 2. Sensitivity Analysis of Randomness: The authors should conduct and discuss sensitivity analyses regarding different random states or random seeds to assess the stability of the method's results. 3. Comparison with Existing Methods on Real Data: While the simulation studies provide thorough benchmarking, the manuscript could enhance its practical value by including detailed comparisons with methods such as SPLS, PMDCCA, and SGCCA using the real-world TCGA datasets. 4. Applicability to Other Diseases: The authors primarily focus on cancer datasets. It is recommended to discuss potential applicability to other disease contexts, such as neurodegenerative or immunological diseases, to illustrate broader utility. 5. Improved Visualization: Some figures in the manuscript have font sizes that are too small, which might impair readability. It is recommended to enlarge the text labels, legends, and axis annotations to ensure that all information is clearly visible and accessible. In Figure 8, the use of sub-labels (such as a, b, c) is mentioned in the text, but these labels are not visible in the figure itself.
2. GigaScience 05 Jan 2026
  
  in GigaScience
  
  ABSTRACTHigh-throughput technologies now produce a wide array of omics data, from genomic and transcriptomic profiles to epigenomic and proteomic measurements. Integrating these diverse data types can yield deeper insights into the biological mechanisms driving complex traits and diseases. Yet, extracting key shared biomarkers from multiple data layers remains a major challenge. We present a multivariate random forest (MRF)–based framework enhanced by a novel inverse minimal depth (IMD) metric for integrative variable selection. By assigning response variables to tree nodes and employing IMD to rank predictors, our approach efficiently identifies essential features across different omics types, even when confronted with high-dimensionality and noise. Through extensive simulations and analyses of multi-omics datasets from The Cancer Genome Atlas, we demonstrate that our method outperforms established integrative techniques in uncovering biologically meaningful biomarkers and pathways. Our findings show that selected biomarkers not only correlate with known regulatory and signaling networks but can also stratify patient subgroups with distinct clinical outcomes. The method’s scalable, interpretable, and user-friendly implementation ensures broad applicability to a range of research questions. This MRF-based framework advances robust biomarker discovery and integrative multi-omics analyses, accelerating the translation of complex molecular data into tangible biological and clinical insights.Competing Interest StatementThe authors have declared no competing interest.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf148), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 2: Yun-Juan Bao
  
  The article presents an Integrative Multi-Omics Random Forest Framework for Robust Biomarker Discovery. It addresses the challenge of extracting key shared biomarkers from multiple omics data types by introducing a multivariate random forest-based approach enhanced by an inverse minimal depth metric.
  
  I have some concerns and comments below: 1. The new algorithm described in the study selected omics variables by assigning response variable to decision tree nodes. How the response variables relate to biological responses/outcomes? From the authors' description, it seems that the selected omics variables using the IMD are almighty, i.e., they can predict anything needed, such as prognosis, cancer types, and et al. Actually, the usual logic to select omics variables to predict prognosis is to evaluate the association between omics variables and survival time. 2. Following the discussion in 1, what is the biological meaning to extract shared biomarkers from multiple data layers? While it is straightforward to think that the shared biomarkers between multiple data layers or data types may induce the same biological responses, the unique biomarkers also matter depending on what biological responses we care. 3. The Introduction section is not sufficient. The biological significance and technical details of "extract shared biomarkers from multiple data layers" need to be explained in more details. 4. It is advised to provide some examples of the statement in the Introduction: "may fail to capture nonlinear interactions" of the current methods (sPLS, CCA). 5. It is also advised to explain and illustrate how the new method proposed in this study addressed the challenge of traditional methods for capturing nonlinear relationships. Ablation study could be one of the choices. 6. The authors showed that their new approach "uncovered known cancer biological relevant pathways". How about the functional enrichment of genes selected from traditional methods, such as sPLS, CCA? 7. The authors showed that the selected RNA-seq and ATAC-seq features using the new approach are able to capture the distinction between different cancer types (Figure 8). It is suggested to quantitatively evaluate this capability using metrics of recall, precision, and et al. to calculate how many samples are corrected classified and how many are mis-classified in comparison with other methods. 8. It is advised to re-find the Discussion. In what scenario their new method can be applied? What biological insights can be obtained and what can be missed by the new method? 9. The authors did not provide sufficient details about the datasets they used in the section Method. How many samples in TCGA? How many features did they use? How many features left after filtering? 10. Although the performance of the new approach showed some kind of superior in comparison with other methods, the authors only used the currently known databases. It is advised to apply their approach to additional testing datasets or real-world datasets to increase the confidence of the conclusion of this study. It is also observed that the performance of sPLS is better than others in some cases (Figure 4). 11. It is suggested to re-fine the figures. The labels and legends are too tiny to be seen. 12. There is no sub-figure labels a,b,c,d,e,f in Figure 8. The positions of sub-figure labels in Figure 3, Figure 4, Figure 5, Figure 7 are not correct.
3. GigaScience 05 Jan 2026
  
  in GigaScience
  
  ABSTRACTHigh-throughput technologies now produce a wide array of omics data, from genomic and transcriptomic profiles to epigenomic and proteomic measurements. Integrating these diverse data types can yield deeper insights into the biological mechanisms driving complex traits and diseases. Yet, extracting key shared biomarkers from multiple data layers remains a major challenge. We present a multivariate random forest (MRF)–based framework enhanced by a novel inverse minimal depth (IMD) metric for integrative variable selection. By assigning response variables to tree nodes and employing IMD to rank predictors, our approach efficiently identifies essential features across different omics types, even when confronted with high-dimensionality and noise. Through extensive simulations and analyses of multi-omics datasets from The Cancer Genome Atlas, we demonstrate that our method outperforms established integrative techniques in uncovering biologically meaningful biomarkers and pathways. Our findings show that selected biomarkers not only correlate with known regulatory and signaling networks but can also stratify patient subgroups with distinct clinical outcomes. The method’s scalable, interpretable, and user-friendly implementation ensures broad applicability to a range of research questions. This MRF-based framework advances robust biomarker discovery and integrative multi-omics analyses, accelerating the translation of complex molecular data into tangible biological and clinical insights.Competing Interest StatementThe authors have declared no competing interest.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf148), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 3: Yingxia Li
  
  Summary: This manuscript presents a novel multivariate random forest (MRF)-based framework, incorporating the Inverse Minimal Depth (IMD) metric, for integrative multi-omics variable selection and robust biomarker discovery. The method is thoughtfully developed, rigorously evaluated through comprehensive simulations, and effectively demonstrated on TCGA datasets. The topic is highly relevant, and the manuscript is generally well-organized and clearly written.
  
  Major comments: The proposed MRF-IMD framework demonstrates significant advantages in handling nonlinear relationships and high-dimensional data integration. However, a more comprehensive comparison with other nonlinear ensemble methods (e.g., gradient boosting or deep learning approaches) is recommended to highlight its uniqueness.
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2025.03.05.641533v2
www.biorxiv.org www.biorxiv.org

The Good, the Bad, and the Ugly: Segmentation-Based Quality Control of Structural Magnetic Resonance Images

4
1. GigaScience 05 Jan 2026
  
  in GigaScience
  
  AbstractThe processing and analysis of magnetic resonance images is highly dependent on the quality of the input data, and systematic differences in quality can consequently lead to loss of sensitivity or biased results. However, varying image properties due to different scanners and acquisition protocols, as well as subject-specific image interferences, such as motion artifacts, can be incorporated in the analysis. A reliable assessment of image quality is therefore essential to identify critical outliers that may bias results. Here we present a quality assessment for structural (T1-weighted) images using tissue classification. We introduce multiple useful image quality measures, standardize them into quality scales and combine them into an integrated structural image quality rating to facilitate the interpretation and fast identification of outliers with (motion) artifacts. The reliability and robustness of the measures are evaluated using synthetic and real datasets. Our study results demonstrate that the proposed measures are robust to simulated segmentation problems and variables of interest such as cortical atrophy, age, sex, brain size and severe disease-related changes, and might facilitate the separation of motion artifacts based on within-protocol deviations. The quality control framework presents a simple but powerful tool for the use in research and clinical settings.Competing Interest StatementThe authors have declared no competing interest.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf146), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 4: Laura Caquelin
  
  Reproducibility report for: The Good, the Bad, and the Ugly: Segmentation-Based Quality Control of Structural Magnetic Resonance Images Journal: GigaScience ID number/DOI: GIGA-D-25-00085 Reviewer(s): Laura Caquelin, Department of Clinical Neuroscience, Karolinska Institutet, Sweden [Worked on reproducing the results and wrote the report] Tobias Wängberg, Department of Clinical Neuroscience, Karolinska Institutet, Sweden [Worked on reproducing the results]
  
  Summary of the Study The study addresses how variability in magnetic resonance images quality, especially from motion artifacts or scanner differences, can affect structural image analysis. It proposes a quality assessment framework for T1-weighted images based on tissue classification and standardized image quality measures. The method is shown to be robust across datasets and conditions, helping to detect outliers and control for motion-related artifacts.
  
  Scope of reproducibility
  
  According to our assessment the primary objective is: to develop and validate a standardized framework for assessing the quality of structural (T1-weighted) MRI images, enabling the detection of artifacts on simulated data.
  
  Outcome: Quantitative quality ratings derived from image properties such as noise-to-contrast ratio (NCR), inhomogeneity-to-contrast ratio (ICR), resolution score (RES), and edge-to-contrast ratio (ECR) and Full-brain Euler characteristic (FEC) combined into a Structural Image Quality Rating (SIQR).
  
  Analysis method outcome: Not precised in the manuscript, but with the Matlab script we identified that the quality scores were correlated using Spearman's rank correlation, and statistical significance was assessed using p-values computed using MATLAB's built-in method.
  
  Main result: Results are presented in Figure 5. "The evaluation on the BWP test dataset showed that most quality ratings have a very high correlation (rho > .950, p < .001) with their corresponding perturbation and a very low correlation (rho < |0.1|) with the other tested perturbations (see table in Figure 5A & C). This suggests considerable specificity of the proposed quality measures. The combined SIQR score also showed a very strong association with the segmentation quality kappa (rho = -.913, p < .001) and brain tissue volumes (rhoCSF/GM/WM = -.472/-.484/.736, pCSF/GM/WM < .001) (Figure 5B). […] The edge-based resolution measure ECR, on the other hand, generally performed better (rho = .828, p < .001), but was more affected by noise (rho = .306, p < .001) and inhomogeneity (rho = .223, p < .001) than other scores."
  
  Availability of Materials a. Data
  
  Data availability: Open
  
  Data completeness: Complete, all data necessary to reproduce main results are available
  
  Access Method: Private journal dropbox but also available on Github repository
  
  Repository: https://github.com/ChristianGaser/cat12 -Data quality: Structured b. Code
  
  Code availability: Share in the private journal dropbox but also open
  
  Programming Language(s): Matlab
  
  Repository link: https://github.com/ChristianGaser/cat12
  
  License: GPL-2.0 License
  
  Repository status: Public
  
  Documentation: Readme file
  
  Computational environment of reproduction analysis
  
  Operating system for reproduction: MacOS 15.5 (reviewer 1) and MacOS 15.1 (reviewer 2)
  
  Programming Language(s): Matlab
  
  Code implementation approach: Using shared code
  
  Version environment for reproduction: Matlab R2024b Update 6 (24.2.2923080) - Trial version
  
  Results
  
  5.1 Original study results - Results 1: Figure 5 C (see screenshot)
  
  5.2 Steps for reproduction
  
  ->Finding how to reproduce the results - Issue 1: The methods section lacks sufficient detail regarding the statistical methodology, and the relevant information is not fully provided in the GitHub repository. -- Resolved: A message has been sent to the authors requesting further clarification on the methodology and additional resources (scripts/data) needed to reproduce the results. The script to reproduce the results is "cat_tst_qa_bwpmaintest.m".
  
  -> Reproduce the results using the "cat_tst_qa_bwpmaintest.m" script. - Issue 2: To run the script "cat_tst_qa_bwpmaintest.m", the "eva_vol_calcKappa" function is missing. -- Resolved: The script was shared and added to the Github repository. - Issue 3: While running the script, the following error message encountered: Assigning to 0 elements using a simple assignment statement is not supported. Consider using comma-separated list assignment.
  
  Error in cat_tst_qa_bwpmaintest (line 481) default.QS{find(cellfun('isempty',strfind(default.QS(:,2),'FEC'))==0),4} = [100, 850]; -- Resolved: This error stops the execution of the script. After discussion with the authors, the exact cause of the error encountered at line 480 was not directly identified. We exchanged and compared our environments at the point just before the error occurred and observed notable differences between them. Our environment is almost empty. The authors identified that the default variable is missing from our environment, even though it is referenced at line 437 by a call to the cat_stat_marks function. We confirmed that all required dependencies were installed (including Statistics toolbox, SPM and CAT12), and that we had access to all the necessary data. To ensure the issue was not due to user error, the code was independently executed by two reviewers. The error was consistently reproduced in both cases. About the setup, I specified to the authors: "To summarize my setup: * I have installed SPM, CAT, and the Statistics Toolbox. * I downloaded all datasets from the GigaScience server. * I also downloaded the IXI T1 data, but I've only kept the version available on the GigaScience server in my working directory. Is the version from GigaScience sufficient? I had presumed that this dataset was pre-processed and ready to use, so I ignored the time-consuming pre-processing step. Your last email seems to confirm this point."
  
  The authors answered that: « Yes, this is correct. However, both directories have to be combined so that the original IXI images and the processing files are included. »
  
  In an attempt to proceed, we modified the portion of the code that triggered the error:
  
  % FEC FECpos = find(cellfun('isempty',strfind(default.QS(:,2),'FEC'))==0); try warning off; [Q.fit.FEC, Q.fit.FECstat] = robustfit(Q.FECgt(M,1),Q.FECo(M,1)); warning on; if ~isempty(FECpos) default.QS{FECpos,4} = round([Q.fit.FEC(1) + Q.fit.FEC(2), Q.fit.FEC(1) + Q.fit.FEC(2) * 6], -1); end
  
  catch Q.fit.FEC = [nan nan]; Q.fit.FECstat = struct('coeffcorr',nan(2,2),'p',nan(2,2)); if ~isempty(FECpos) default.QS{FECpos,4} = [100 850]; end end
  
  Following this adjustment, the end of the script "cat_tst_qa_bwpmaintest.m" ran without issue and generated output results:
  
  Finally, the error was identified after numerous exchanges with the authors. The function "cat_stat_marks", available in the Github repository, was not shared in the FTP server. With this function added, the script runs correctly. Please note that the link to the Github repository where the software code can be found is not specified in the manuscript.
  
  -> Compare the results reproduced and the original results - Issue 4: Discrepancy between reproduced results, output results provided by the authors and the original results shown in Figure 5C. -- Unresolved: We reproduced the figures and the corresponding output table using the modified "cat_tst_qa_bwpmaintest.m" script. We ran the script using the only default QC version selected in the script ("cat_vol_qa201901x"). By comparing our output with the result files shared by the authors, we were able to confirm that we had executed the correct pipeline. However, we encountered a discrepancy: neither the generated file in our run (tst_cat_col_qa201901x_irBWPC_HC_T1_pn9_rf100pC_vx200x200x200rptable.csv) nor the corresponding file provided by the authors (outputs from BWPmain_full_202504) matched the numerical values presented in Figure 5C of the manuscript. We contacted the authors to clarify whether the default QC version used in the script was indeed the one produce the figure. In response, they confirmed:
  
  "All figures should show the results of this QC version although I had the plan to run a final check update after the reviewer comments (the figures are finally arranged in Adobe Illustrator)."
  
  Therefore, although the correct version of the QC was used, the differences in the results shown in Figure 5C remain unexplained. This issue is still unresolved.
  
  5.3 Statistical comparison Original vs Reproduced results - Results: Screenshot of reproduced tst_cat_vol_qa201901x_irBWPC_HC_T1_pn9_rf100pC_vx200x200x200_rptable.csv table
  
  Comments: Several p-values in the reproduced results appear as exactly 0 (0.00000000e+00), which is unlikely from a statistical point of view. It is possible that these values are just extremely small and were rounded down. However, this could also point a problem in the script. Further investigation would be needed to determine the cause.
  
  Errors detected: Values in Figure 5C do not correspond to those provided by the authors in the FTP server in the files (tst_cat_vol_qa201901x_irBWPC_HC_T1_pn9_rf100pC_vx200x200x200_rptable.csv). Multiple inconsistencies were observed, suggesting potential errors in the manuscript figure or mismatches between file versions (see file Comparison_original_rptable_vs_fig5C_data.csv for comparison).
  
  (Screenshot of Figure 5C)
  
  (Screenshot of the original output corresponding to the Figure 5C)
  
  Statistical Consistency: The reproduced correlation table (tst_cat_vol_qa201901x_irBWPC_HC_T1_pn9_rf100pC_vx200x200x200_rptable.csv). differs from the original in terms of r-values and p-values. Compared to the Figure 5C, the reproduced r-values do not all match those shown in the figure. P-values cannot be directly compared to Figure 5C, as they are represented by a color gradient without a scale or legend, making direct comparison impossible.
  
  Conclusion
  
  Summary of the computational reproducibility review The computational reproducibility of the main result we identified for the study is partially achieved. After several technical issues related to missing functions, I was able to execute the script to reproduce values of Figure 5C ("cat_tst_qa_bwpmaintest.m") and obtain ouput results. However, discrepancies were observed when comparing the reproduced results (tst_cat_col_qa201901x_irBWPC_HC_T1_pn9_rf100pC_vx200x200x200rptable.csv) to both:
  
  the output file provided by the authors, and
  
  the original results presented in figure 5C of the manuscript. Notably, the output file provided by the authors and the results in figure 5C do not match either, indicating potential errors or file versions mismatches. Additionally, many p-values in the reproduced results are equal to 0, which suggests a formatting issue or a problem in the script. Figure 5C also lacks a scale, legend detail, or supplementary data to make possible to verify p-values (assuming the color gradient represents the p-values).
  
  Recommendations for authors We strongly recommend the authors to: -- Ensure all essential code and functions are included in the shared repositories. Some necessary files were not included in the FTP server provided with the paper. Although the GitHub repository (https://github.com/ChristianGaser/cat12) was shared with the journal, but it is not referenced in the manuscript, making it difficult for external users to locate. -- Add detailed documentation of the statistical methods: the current manuscript lacks sufficient information regarding the statistical methodology used, at least for the purpose of the reproducibility review. Please, include detailed explanation of statistical tests, packages and parameter settings (e.g. QC version) to improve reproducibility. -- Clarify the versioning and outputs for the figures: there is a lack of clarity regarding which specific data outputs were used to generate figure 5C. Providing metadata or links to the exact output file used would help to resolve this issue. -- Provide raw numerical data behind figures: figure 5C seems to display p-values using a color gradient but no scale or legend is provided. Sharing the raw data used would allow the comparison and the reproducibility of the figure. -- Improve the clarity of execution instructions and address potential p-values issues: the issue with p-values showing up as exactly 0 in the reproduced results might be caused by differences in the environment setup, such as missing variables, different software versions, or skipped steps before running the script. Improving the instructions for setting up the environment and running the would help prevent issues and facilitate reproducibility.
2. GigaScience 05 Jan 2026
  
  in GigaScience
  
  AbstractThe processing and analysis of magnetic resonance images is highly dependent on the quality of the input data, and systematic differences in quality can consequently lead to loss of sensitivity or biased results. However, varying image properties due to different scanners and acquisition protocols, as well as subject-specific image interferences, such as motion artifacts, can be incorporated in the analysis. A reliable assessment of image quality is therefore essential to identify critical outliers that may bias results. Here we present a quality assessment for structural (T1-weighted) images using tissue classification. We introduce multiple useful image quality measures, standardize them into quality scales and combine them into an integrated structural image quality rating to facilitate the interpretation and fast identification of outliers with (motion) artifacts. The reliability and robustness of the measures are evaluated using synthetic and real datasets. Our study results demonstrate that the proposed measures are robust to simulated segmentation problems and variables of interest such as cortical atrophy, age, sex, brain size and severe disease-related changes, and might facilitate the separation of motion artifacts based on within-protocol deviations. The quality control framework presents a simple but powerful tool for the use in research and clinical settings.Competing Interest StatementThe authors have declared no competing interest.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf146), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 3: Cyril Pernet
  
  The paper describes an alternative way to QC T1w images with 2 major innovations: a different set of metrics not relying on background and a global score that combines those metrics. In addition, all of this is integrated in a well maintained toolbox allowing easy usage.
  
  I only have suggestions (ie it does not have to be all done) as the overall paper is well written, easy to follow and analyses well conducted. P6 NCR: it can be nice to demonstrate how it performs compared to traditional CNR (mean of the white matter intensity values minus the mean of the gray matter intensity values divided by the standard deviation of the values outside the brain) -- differs markedly because of background difference for sure, since you have plenty of test images you could show that more clearly (later in the method, based on what criteria/reason 'local' is defined as 555?) P7 ECR should capture something similar to Entropy Focus Criterion, would be nice to provide a direct comparison P8 typo, you meant equation 2 P8 SIQR I'm guessing you have experimented with the power function - maybe a side note to share your experience of why or how it works better than eg square
  
  Dr Cyril Pernet
3. GigaScience 05 Jan 2026
  
  in GigaScience
  
  AbstractThe processing and analysis of magnetic resonance images is highly dependent on the quality of the input data, and systematic differences in quality can consequently lead to loss of sensitivity or biased results. However, varying image properties due to different scanners and acquisition protocols, as well as subject-specific image interferences, such as motion artifacts, can be incorporated in the analysis. A reliable assessment of image quality is therefore essential to identify critical outliers that may bias results. Here we present a quality assessment for structural (T1-weighted) images using tissue classification. We introduce multiple useful image quality measures, standardize them into quality scales and combine them into an integrated structural image quality rating to facilitate the interpretation and fast identification of outliers with (motion) artifacts. The reliability and robustness of the measures are evaluated using synthetic and real datasets. Our study results demonstrate that the proposed measures are robust to simulated segmentation problems and variables of interest such as cortical atrophy, age, sex, brain size and severe disease-related changes, and might facilitate the separation of motion artifacts based on within-protocol deviations. The quality control framework presents a simple but powerful tool for the use in research and clinical settings.Competing Interest StatementThe authors have declared no competing interest.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf146), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 2: Oscar Esteban
  
  Technical Note GIGA-D-25-00085 introduces a segmentation-based quality control (QC) framework for T1-weighted structural MRI integrated into the CAT12 toolbox. The approach defines five interpretable image quality metrics—noise-to-contrast ratio (NCR), inhomogeneity-to-contrast ratio (ICR), resolution score (RES), edge-to-contrast ratio (ECR), and full-brain Euler characteristic (FEC)—which are combined into a composite Structural Image Quality Rating (SIQR). The tool aims to provide a standardized, interpretable scoring system for identifying poor-quality scans, with validation across simulated datasets and real-world imaging data.
  
  Strengths
  
  The manuscript addresses a critical need in neuroimaging by presenting an automated, interpretable, and practical framework for quality control of T1-weighted structural MRI. By integrating multiple segmentation-derived metrics into a single Structural Image Quality Rating (SIQR), the approach enables fast, standardized assessment of image quality. The tool is embedded in the widely used CAT12/SPM ecosystem, facilitating adoption, and it is validated across a range of synthetic and real-world datasets. The scoring system is designed with user accessibility in mind, offering a clear grading scale and robust detection of motion-related artifacts, making it particularly well-suited for use in large-scale research and clinical imaging settings.
  
  Weaknesses
  
  Ambiguity of scope and segmentation dependency. A fundamental issue with the manuscript is its failure to clearly define the proposed QC framework's intended scope. If it is intended as a general-purpose image quality assessment tool, then several limitations become critical: its reliance on accurate tissue segmentation, its omission of background signal, its restricted validation within the CAT12 pipeline, and its lack of demonstrated interoperability with other workflows or populations. The method's reliability across different segmentation tools (e.g., FreeSurfer, FSL, SynthSeg) or in anatomically atypical populations (e.g., pediatric, lesioned brains) is untested. Conversely, if the framework is intended as a CAT12-specific internal QC tool, then the presentation is misleading. The inclusion of cross-tool benchmarks (e.g., MRIQC), the use of generalized grading schemes, and the claims of robustness give the impression of broader applicability. In this narrower interpretation, some concerns (e.g., pipeline generalization) would be less pressing, but others—such as the MRIQC comparison—become more problematic and unjustified. The manuscript would benefit greatly from explicitly stating whether the goal is a broadly applicable QC solution or a targeted add-on for CAT12 workflows.
  
  Lack of compliance with GigaScience reproducibility standards. The manuscript does not currently meet GigaScience's data and code availability requirements. The code used to generate results and figures is not publicly accessible—only available upon request—which directly conflicts with the journal's expectations for open, reproducible research. Similarly, while the data are drawn from public sources, the manuscript lacks direct links, accession numbers, or DOIs for the datasets used, and provides no clarity on data preprocessing or analysis scripts. There is also no reference to licensing for the CAT12 toolbox or the code used in the study, and no reproducibility capsule (e.g., containerized environment, workflow script) is offered. These omissions limit the transparency and reusability of the work and must be addressed to comply with the FAIR principles and GigaScience's editorial policies.
  
  Mischaracterization of background-based IQMs. In the "SIQR measure development" section, the manuscript states: "Image quality measures are commonly estimated from the image background (Mortamed et al., 2008; Esteban et al., 2017)." This statement is factually incorrect and conceptually misleading. First, the citation is incorrect—Mortamed should be Mortamet (2009). Second, it misrepresents tools like MRIQC, where most quality metrics are computed within brain tissue, including CJV, SNR, and contrast-based measures. Third, the authors entirely omit recent work (e.g., Pizarro et al., 2016; Provins et al., 2025\) showing that artifacts such as ghosting, wrap-around, and motion often manifest more clearly in the background, due to the nature of Fourier reconstruction. By excluding background regions, the proposed method may miss artifacts that are visible but lie outside the segmented brain, and the trade-offs of this design decision are not discussed. The rationale based on defacing is only partial: defacing typically removes the face, not the broader background, where artifact signals often dominate. The statement as written oversimplifies QC practices and signals a bias toward justifying the framework's internal constraints rather than engaging with the full methodological landscape. References: Provins, C., … Esteban, O. (2025). Removing facial features from structural MRI images biases visual quality assessment PLOS Biology. doi:10.1371/journal.pbio.3003149 (OA). Pizarro RA, et al. (2016). Automated quality assessment of structural magnetic resonance brain images based on a supervised machine learning algorithm. Front Neuroinf. 10. doi:10.3389/fninf.2016.00052.
  
  Underdeveloped and opaque benchmarking against MRIQC. The benchmarking against MRIQC is reported only in the Results section, with no corresponding description in the Methods. It is surprising that MRIQC is not mentioned by name until page 14, despite the Esteban et al. (2017) reference appearing earlier in a different context. This suggests that the treatment of MRIQC—a widely adopted, general-purpose QC tool—has not been as thorough or fair as would be desirable. Key methodological details are missing: the authors do not explain how MRIQC was executed, how specific features (e.g., snr_wm, cjv) were selected, or whether a multivariate classifier was considered. Given that MRIQC's full model leverages multiple features simultaneously, limiting the comparison to univariate metrics weakens the validity of the claim that SIQR outperforms existing approaches. A more balanced, transparent benchmarking setup would strengthen the manuscript considerably. This benchmarking also mentions an "SPM12-based" QC performance but does not clarify how and why this comparison is made.
  
  No analysis of failure cases. The manuscript does not present examples of false positives or false negatives—cases where SIQR fails to align with visual inspection or known ground truth. Without understanding when and why the metric fails, users cannot judge the risk of misclassification or apply it conservatively in sensitive datasets.
  
  Minor Issues
  
  Figure 7 could benefit from clearer annotation of thresholds and misclassified cases to help interpret the ROC curves.
  
  While the title "The Good, the Bad, and the Ugly" is a play on the classic western film, this informal or humorous reference may be perceived as inappropriate in a scientific context—especially for a methods paper intended to support standardization and reproducibility. The title does not convey the technical scope or scientific contribution of the work, which may undermine its visibility and perceived rigor. A more descriptive and neutral title—e.g., "Segmentation-Based Quality Control of Structural MRI using the CAT12 Toolbox"—would better reflect the content and purpose of the manuscript.
  
  While the authors validate their approach against synthetic degradations and segmentation-derived kappa scores, they do not sufficiently leverage human expert QC ratings. Greater engagement with visual QC standards would make the case for SIQR's practical value more compelling.
  
  I was given access to the supporting data but chose not to proceed with reproducibility checks at this stage, as the manuscript does not currently meet GigaScience's basic standards for code and data transparency. I look forward to reviewing a revised version that clearly defines the scope of the method, improves methodological transparency, and brings the manuscript into compliance with the journal's reproducibility and FAIR data principles.
  
  Best regards,
  
  Oscar Esteban, Ph. D. Research and Teaching FNS Fellow Dept. of Radiology, CHUV, University of Lausanne
4. GigaScience 05 Jan 2026
  
  in GigaScience
  
  AbstractThe processing and analysis of magnetic resonance images is highly dependent on the quality of the input data, and systematic differences in quality can consequently lead to loss of sensitivity or biased results. However, varying image properties due to different scanners and acquisition protocols, as well as subject-specific image interferences, such as motion artifacts, can be incorporated in the analysis. A reliable assessment of image quality is therefore essential to identify critical outliers that may bias results. Here we present a quality assessment for structural (T1-weighted) images using tissue classification. We introduce multiple useful image quality measures, standardize them into quality scales and combine them into an integrated structural image quality rating to facilitate the interpretation and fast identification of outliers with (motion) artifacts. The reliability and robustness of the measures are evaluated using synthetic and real datasets. Our study results demonstrate that the proposed measures are robust to simulated segmentation problems and variables of interest such as cortical atrophy, age, sex, brain size and severe disease-related changes, and might facilitate the separation of motion artifacts based on within-protocol deviations. The quality control framework presents a simple but powerful tool for the use in research and clinical settings.Competing Interest StatementThe authors have declared no competing interest.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf146), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 1: Chris Foulon
  
  The article presents a valuable effort towards standardising quality control methods and their evaluation. However, too many choices seem arbitrary without sufficient justification, and too many sections are unclear. Overall, the quality of the work cannot be fully assessed in the current state of the manuscript, and major revisions are needed to correct that. There is also not enough comparison (one) with other methods and no way of evaluating whether these measures are relevant to actual downstream imaging uses. Additionally, the article's goal is highly unclear and led me to think the segmentation measures were part of the QC pipeline until I read the discussion ... Nothing until the discussion explains that the segmentation measures are used to evaluate the single SIQR score output of the QC pipeline.
  
  Comments: "All measures and tools are part of the Computational Anatomy Toolbox (CAT; https://neuro-jena.github.io//cat, Gaser et al., 2024) of the Statistical Parametric Mapping (SPM; http://www.fil.ion.ucl.ac.uk/spm, Ashburner et al. 2002) software and also available as a standalone version (https://neuro-jena.github.io/enigma-cat12/#standalone)." I cannot really expect everyone to avoid Matlab tools. Still, Matlab is a drag to the development of scalable tools nowadays (every system admin's nightmare is to have to try to make Matlab tools run on high-performance computing servers).
  
  "such as noise, inhomogeneities, and resolution (Figure 1B)." At this point in the article, it's a bit unclear how that works in Figure 1B.
  
  "It is assessed within optimized cerebrospinal fluid (CSF) and white matter (WM) regions." Then, the NCR relies on the segmentation, right? What if the segmentation fails?
  
  Oh, most of the measures actually rely on the segmentation. Are segmentation errors accounted for in the tool? I am thinking specifically about "abnormal" brains that can be difficult for segmentation algorithms. At least at this point of the article, it's not clear.
  
  "To accommodate various international rating systems, we have adopted a linear percentage and a corresponding (alpha-)numeric scaling." this doesn't match the complexity of the following explanation about the rather arbitrary range. I think a much more international and understandable rating would have been a 0 to 1 range. A 0.5 to 10.5 range is not helping users at all. As the rating is linear, I am struggling to see the added value of this choice.
  
  "Although the BWP does not include the simulation of motion artifacts, these are in general comparable to an increase of noise in the BWP dataset by 2 percentage points." Maybe that should be justified with a reference? "in general" might be a bit light to justify not having a direct measure for something presented as important (motion artefacts) in the introduction and goal of the tool. I think the absence of a noise estimation in the QC ratings should be more thoroughly justified.
  
  "To balance the sensitivity to different quality measures while ensuring that the necessary quality conditions are met, we apply an exponentially weighted averaging approach — similar to the root mean square (RMS) but using the fourth power and fourth root." Why is there no justification or references for these arbitrary choices? Why not the fifth root or tenth root? Why the square root and not an exponential or any other function?
  
  "Sample Normalization for Outlier Detection" It is unclear whether this is systematically applied or not. Is it a separate measure, or is it aggregated into another score? That measure could be relevant in many cases but could also be really bad in some specific cases (for example, historical data where the "ideal" quality would probably be well below standards.
  
  "raw (co-registered)" Well, it is not raw if it's co-registered. I suggest reformulation to avoid confusion with actual raw images.
  
  The "Evaluation Concept and Data" section is very unclear. The need for a training-testing scheme is not explained, and the scheme itself is very arbitrary (choosing odd and even numbered files ordered by filenames). How does that splitting strategy help with generalisation? Why that specific split? Why not another? How do we know that split is not biased? Finally, the selection of 6 scans also seems completely arbitrary. Overall, this section does not provide enough information to justify the seemingly arbitrary choices.
  
  "Of note, obvious subject/scan-specific motion artifacts generally increase the scans' rating for about 1 grade, which corresponds to a decrease of 10 rps (and +0.5 grade / -5 rps for light artifacts), in comparison to the typical rating achieved by the majority of scans of the same protocol." This is incredibly vague! How are readers supposed to evaluate the quality control measures with this information?
  
  Discussion: "as this is more relevant for segmentation and surface reconstruction (Ashburner et al., 2005)." A lot of work has been done in these domains in 20 years; this reference, however solid, is not enough to justify that choice. This might not be relevant with the methods developed in the last 20 years.
  
  "with a power of 4 rather than 2, to place greater emphasis on the more problematic aspects of image quality." Still not enough to justify that choice. The authors failed to convince me that one single score is better than reporting all the measures significantly, as different quality measures will influence different tasks. A very practical example is the fact that the vast majority of acquisitions in clinical settings, the resolution is anisotropic (though less with T1 images nowadays, historical datasets will still have it). This anisotropy is not necessarily an issue for human diagnosis, for example; however, aggregating all the scores in one might hide that a low-quality measurement might not affect the specific downstream task. Coupled with the lack of justification for the factor scalings, this choice of a single score is a significant negative point for the tool.
  
  Data availability: Where can the sources of these specific tools be accessed?
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2025.02.28.640096v1
Nov 2025
www.biorxiv.org www.biorxiv.org

pyRootHair: Machine Learning Accelerated Software for High-Throughput Phenotyping of Plant Root Hair Traits

2
1. GigaScience 28 Nov 2025
  
  in GigaScience
  
  1 AbstractRoot hairs play a key role in plant nutrient and water uptake. Historically, root hair traits have been largely quantified manually. As such, this process has been laborious and low-throughput. However, given their importance for plant health and development, high-throughput quantification of root hair morphology could help underpin rapid advances in the genetic understanding of these traits. With recent increases in the accessibility and availability of artificial intelligence (AI) and machine learning techniques, the development of tools to automate plant phenotyping processes has been greatly accelerated. Here, we present pyRootHair, a high-throughput, AI-powered software application to automate root hair trait extraction from images of plant roots grown on agar plates. pyRootHair is capable of batch processing over 600 images per hour without manual input from the end user. In this study, we deploy pyRootHair on a panel of 24 diverse wheat cultivars and uncover a large, previously unresolved amount of variation in many root hair traits. We show that the overall root hair profile falls under two distinct shape categories, and that different root hair traits often correlate with each other. We also demonstrate that pyRootHair can be deployed on a range of plant species, including arabidopsis (Arabidopsis thaliana), brachypodium (Brachypodium distachyon), medicago (Medicago truncatula), oat (Avena sativa), rice (Oryza sativa), teff (Eragostis tef) and tomato (Solanum lycopersicum). The application of pyRootHair enables users to rapidly screen large numbers of plant germplasm resources for variation in root hair morphology, supporting high-resolution measurements and high-throughput data analysis. This facilitates downstream investigation of the impacts of root hair genetic control and morphological variaton on plant performance.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf141), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 2: Nicolas Gaggion
  
  The manuscript "pyRootHair: Machine Learning Accelerated Software for High-Throughput Phenotyping of Plant Root Hair Traits" presents a valuable tool for plant phenotyping in microscopy images. As it stands, my recommendation is to accept after minor revisions.
  
  I found the GitHub repository provides detailed instructions and was straightforward to install and run. The whole process took only a few minutes to execute, which speaks well to the software's accessibility.
  
  To further enhance the clarity, precision, and accessibility of the manuscript, I have several comments.
  
  On the random forest classifier training process, the manuscript states "For this comparison, the RFC was trained on a single input image, and used to perform inference on all subsequent images." However, the repository documentation indicates: "To train a random forest model, you will need to train the model on a single representative example of an image, and a corresponding binary mask of the image." The repository further notes that "You will need to ensure that all the images are relatively consistent in terms of lighting, appearance, root hair morphology, and have the same input dimensions. Should your images vary for these traits, you will need to train separate random forest models for different batches of images."
  
  The manuscript needs clarification on this training process. Please specify whether users are expected to manually segment one of their own images, use the nnUNet model to generate binary segmentation and refine it using annotation tools (such as ilastik), or select one of the images from ones provided by the authors of the manuscript that best matches their new data.
  
  Regarding nnUNet performance, I support the decision not to compare with other models, as nnUNet represents state-of-the-art performance and enables easy training for non-expert users. However, I have several questions: Do you plan to release the training dataset so users can retrain the model by incorporating new manually annotated data? The manuscript would benefit from quantifying segmentation performance by crop type. Measuring performance solely by computing time is insufficient, and quantitative metrics such as Dice scores on test holdout sets or cross-validation results (as performed by the nnUNet model) should be reported.
  
  The current abstract describes pyRootHair as an "AI-powered software application to automate root hair trait extraction from images of plant roots grown on agar plates." This description needs to clarify that images were obtained via microscopy. Do you have insights on how the trained model performs across different microscope systems?
  
  The manuscript requires additional clarification on the root straightening process using piecewise transformation, as this represents an important step in the measurement procedure. Please specify how this is performed and whether a specific algorithm or function from a library is used for the piecewise affine transformation. For readers who are not computer vision specialists, a figure illustrating the measurement steps (segmentation → skeletonization → straightening → measurement) would be valuable.
  
  Really minor comments: It would be helpful if the demo generated all plots by default, and Random Forest Classifier (RFC) is not included in the abbreviation list.
  
  Overall, this represents solid work that addresses an important need in plant phenotyping research. The suggested clarifications will enhance both the scientific rigor and practical utility of the contribution.
2. GigaScience 28 Nov 2025
  
  in GigaScience
  
  1 AbstractRoot hairs play a key role in plant nutrient and water uptake. Historically, root hair traits have been largely quantified manually. As such, this process has been laborious and low-throughput. However, given their importance for plant health and development, high-throughput quantification of root hair morphology could help underpin rapid advances in the genetic understanding of these traits. With recent increases in the accessibility and availability of artificial intelligence (AI) and machine learning techniques, the development of tools to automate plant phenotyping processes has been greatly accelerated. Here, we present pyRootHair, a high-throughput, AI-powered software application to automate root hair trait extraction from images of plant roots grown on agar plates. pyRootHair is capable of batch processing over 600 images per hour without manual input from the end user. In this study, we deploy pyRootHair on a panel of 24 diverse wheat cultivars and uncover a large, previously unresolved amount of variation in many root hair traits. We show that the overall root hair profile falls under two distinct shape categories, and that different root hair traits often correlate with each other. We also demonstrate that pyRootHair can be deployed on a range of plant species, including arabidopsis (Arabidopsis thaliana), brachypodium (Brachypodium distachyon), medicago (Medicago truncatula), oat (Avena sativa), rice (Oryza sativa), teff (Eragostis tef) and tomato (Solanum lycopersicum). The application of pyRootHair enables users to rapidly screen large numbers of plant germplasm resources for variation in root hair morphology, supporting high-resolution measurements and high-throughput data analysis. This facilitates downstream investigation of the impacts of root hair genetic control and morphological variaton on plant performance.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf141), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 1: Wanneng Yang
  
  This paper introduces an artificial intelligence-driven software named pyRootHair, which enables high-throughput automated extraction of root hair traits from plant root images, thereby facilitating rapid analysis of root hair morphological variations in various plants, including wheat. However, the following issues remain: 1）Compared to previously published work, the contributions and innovations of this study are not sufficiently highlighted. For instance, the work by Lu, Wei, Xiaochan Wang, and Wei Jia, titled "Root hair image processing based on deep learning and prior knowledge" (Comput. Electron. Agric. 202, 2022: 107397), should be explicitly referenced to clarify the advancements presented here. 2） Although the study demonstrates that pyRootHair can be applied to multiple plant species, including Arabidopsis, Brachypodium, rice, and tomato, the primary validation and analysis are conducted on wheat. For other species, only segmentation results and trait extraction figures are presented, lacking detailed comparative validation with manual measurements as thoroughly as for wheat. 3）The process of "straightening" curved roots is implemented, but the potential introduction of new errors by this procedure is not discussed. 4） In the trait validation section, the correlation analysis between automated and manual measurements shows strong agreement for root hair length and root length, but weaker correlation for elongation zone length. The study should provide a more in-depth discussion on the possible reasons for this lower correlation. 5）The details of the core algorithms (CNN architecture, random forest classifier) are insufficiently described. Key aspects such as parameter selection, optimization, training procedures, and the division ratios of the training/validation/test sets are not clearly specified. Additionally, the specific strategies for data augmentation are not mentioned. 6） No quantitative comparisons with similar tools (e.g., in terms of speed and accuracy) are provided.
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2025.07.14.664697v1
www.biorxiv.org www.biorxiv.org

RNA-SeqEZPZ: A Point-and-Click Pipeline for Comprehensive Transcriptomics Analysis with Interactive Visualizations

2
1. GigaScience 28 Nov 2025
  
  in GigaScience
  
  RNA-Seq analysis has become a routine task in numerous genomic research labs, driven by the reduced cost of bulk RNA sequencing experiments. These generate billions of reads that require accurate, efficient, effective, and reproducible analysis. But the time required for comprehensive analysis remains a bottleneck. Many labs rely on in-house scripts, making standardization and reproducibility challenging. To address this, we developed RNA-SeqEZPZ, an automated pipeline with a user-friendly point-and-click interface, enabling rigorous and reproducible RNA-Seq analysis without requiring programming or bioinformatics expertise. For advanced users, the pipeline can also be executed from the command line, allowing customization of steps to suit specific requirements.This pipeline includes multiple steps from quality control, alignment, filtering, read counting to differential expression and pathway analysis. We offer two different implementations of the pipeline using either (1) bash and SLURM or (2) Nextflow. The two implementation options allow for straightforward installation, making it easy for individuals familiar with either language to modify and/or run the pipeline across various computing environments.RNA-SeqEZPZ provides an interactive visualization tool using R shiny to easily select the FASTQ files for analysis and compare differentially expressed genes and their functions across experimental conditions. The tools required by the pipeline are packaged into a Singularity image for ease of installation and to ensure replicability. Finally, the pipeline performs a thorough statistical analysis and provides an option to perform batch adjustment to minimize effects of noise due to technical variations across replicates.RNA-SeqEZPZ is freely available and can be downloaded from https://github.com/cxtaslim/RNA-SeqEZPZ.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf133), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 2: Yang Yang
  
  The manuscript describes RNA-SeqEZPZ, an automated RNA-Seq analysis pipeline with a user-friendly point-and-click interface. It aims to make comprehensive transcriptomics analyses more accessible to researchers who lack extensive bioinformatics skills by addressing common issues with standardization and usability that arise from using in-house scripts. The pipeline's main features are the use of a Singularity container to simplify software installation and a Nextflow version to support scalability across different computing environments like clouds and clusters. However, I'm not sure if this manuscript fits the journal's scope in its current form. It seems to be just an integration of existing tools without offering new methods or findings.
  
  Major comments:
  
  The manuscript mentions several existing RNA-Seq pipelines, such as ENCODE, nf-core, ROGUE, Shiny-Seq, bulkAnalyseR, Partek™ flow, RaNA-Seq, and RASflow. A more detailed comparison of RNA-SeqEZPZ with these tools is needed, especially regarding specific features, performance metrics, and ease of use. For example, it would be helpful to compare the computational resources required by each pipeline or the statistical methods used for differential expression analysis.
  
  The manuscript emphasizes reproducibility through Singularity containers and Nextflow. However, it would be stronger if it included a more rigorous demonstration of reproducibility. This could involve running the pipeline on multiple datasets and comparing the results, or providing a detailed protocol for other researchers to reproduce the findings.
  
  The manuscript highlights the scalability and portability of RNA-SeqEZPZ due to its Nextflow version. It would be useful to include specific examples of how the pipeline has been used in different computing environments (e.g., cloud, cluster) and to provide performance data to demonstrate its scalability.
  
  The point-and-click interface is a key feature, but the manuscript could benefit from a more detailed description of the interface and its functionalities. Including screenshots or a video demonstration would be valuable for potential users.
  
  The manuscript shows the effects of batch adjustment using a public dataset. It would be beneficial to expand this section with a discussion of the limitations of batch adjustment methods and to provide guidance on when and how to apply them.
2. GigaScience 28 Nov 2025
  
  in GigaScience
  
  RNA-Seq analysis has become a routine task in numerous genomic research labs, driven by the reduced cost of bulk RNA sequencing experiments. These generate billions of reads that require accurate, efficient, effective, and reproducible analysis. But the time required for comprehensive analysis remains a bottleneck. Many labs rely on in-house scripts, making standardization and reproducibility challenging. To address this, we developed RNA-SeqEZPZ, an automated pipeline with a user-friendly point-and-click interface, enabling rigorous and reproducible RNA-Seq analysis without requiring programming or bioinformatics expertise. For advanced users, the pipeline can also be executed from the command line, allowing customization of steps to suit specific requirements.This pipeline includes multiple steps from quality control, alignment, filtering, read counting to differential expression and pathway analysis. We offer two different implementations of the pipeline using either (1) bash and SLURM or (2) Nextflow. The two implementation options allow for straightforward installation, making it easy for individuals familiar with either language to modify and/or run the pipeline across various computing environments.RNA-SeqEZPZ provides an interactive visualization tool using R shiny to easily select the FASTQ files for analysis and compare differentially expressed genes and their functions across experimental conditions. The tools required by the pipeline are packaged into a Singularity image for ease of installation and to ensure replicability. Finally, the pipeline performs a thorough statistical analysis and provides an option to perform batch adjustment to minimize effects of noise due to technical variations across replicates.RNA-SeqEZPZ is freely available and can be downloaded from https://github.com/cxtaslim/RNA-SeqEZPZ.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf133), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 1: Unitsa Sangket
  
  This research presents a well-designed and powerful program for comprehensive transcriptomics analysis with interactive visualizations. The tool is conceptually strong and user-friendly, requiring only raw reads in FASTQ format to initiate the analysis, with no need for manual quality checks. However, a limitation is that the software must be installed manually, which typically requires access to a high-performance computing (HPC) system and support from a system administrator for installation and server maintenance. As such, non-technical users may find it difficult to install and operate the program independently.
  
  With appropriate revisions based on the comments below, the manuscript has the potential to be significantly improved.
  
  Page 8, line 158-160 "DESeq2 was selected based on findings by Rapaport et al. (2013)40, which demonstrated its superior specificity and sensitivity as well as good control of false positive errors." The findings in the paper titled "bestDEG: a web-based application automatically combines various tools to precisely predict differentially expressed genes (DEGs) from RNA-Seq data" (https://peerj.com/articles/14344) show that DESeq2 achieves higher sensitivity than other tools when applied to newer human RNA-Seq datasets. This finding should be included in the manuscript. For example, DESeq2 was selected based on findings by Rapaport et al. (2013)⁴⁰, which demonstrated its superior specificity and sensitivity as well as good control of false positive errors. Additionally, recent findings from the bestDEG study (cite bestDEG) further support the higher sensitivity of DESeq2 than other tools when applied to newer human RNA-Seq datasets.
  
  Page 6, line 124-125 "Raw reads quality control are then performed using 125 FASTQC18 and QC reports are compiled using MultiQC19." The quality of the trimmed reads can be assessed using FastQC, as demonstrated and summarized in the paper titled "VOE: automated analysis of variant epitopes of SARS-CoV-2 for the development of diagnostic tests or vaccines for COVID-19." (https://peerj.com/articles/17504/) (Page 4, in last paragraph ""(1) Per base sequence quality (median value of each base greater than 25), (2) per sequence quality (median quality greater than 27), (3) perbase N content (N base less than 5% at each read position) and (4) adapter content (adapter sequences at each position less than 5% of all reads)". This point should be mentioned in the manuscript, including the cutoff values for each FastQC metrics used in RNA-SeqEZPZ, as these thresholds may vary. For example, the quality of the trimmed FASTQ reads was assessed based on the four FastQC metrics, as summarized by Lee et al. (2024). The cutoffs for RNA-SeqEZPZ were set as follows: the median value of each base must be greater than [x], the median quality score must be above [y], the percentage of N bases at each read position must be less than [z]%, and the proportion of adapter sequences at each position must be below [xx]% of all reads.
  
  The programs used for counts table creation and alignment process should be mentioned in the manuscript.
  
  The default cutoffs for FDR and log₂ fold change, as well as instructions on how to modify these thresholds, should be clearly stated in the manuscript.
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2024.12.20.629844v1
www.biorxiv.org www.biorxiv.org

EMImR: a Shiny Application for Identifying Transcriptomic and Epigenomic Changes

2
1. GigaScience 28 Nov 2025
  
  in GigaByte
  
  Editors Assessment:
  
  Coded and written up as part of the African Society for Bioinformatics and Computational Biology (ASBCB) Omicscodeathons, EMImR is a novel Shiny application for transcriptomic and epigenomic change identification and correlation wrapped up using a combination of Bioconductor and CRAN packages. Case studies are on publicly available GEO data corresponding to sequencing data of human blood cell samples of multiple sclerosis patients to demonstrate how the tool works. And a documentation and videos are provided. Peer review and the study highlighting the usefulness of the developed tool for analyzing transcriptomic and epigenomic data.
  
  This evaluation refers to version 1 of the preprint
  
  Summary
2. GigaScience 28 Nov 2025
  
  in GigaByte
  
  AbstractIdentifying differentially expressed genes associated with genetic pathologies is crucial to understanding the biological differences between healthy and diseased states and identifying potential biomarkers and therapeutic targets. However, gene expression profiles are controlled by various mechanisms including epigenomic changes, such as DNA methylation, histone modifications, and interfering microRNA silencing.We developed a novel Shiny application for transcriptomic and epigenomic change identification and correlation using a combination of Bioconductor and CRAN packages.The developed package, named EMImR, is a user-friendly tool with an easy-to-use graphical user interface to identify differentially expressed genes, differentially methylated genes, and differentially expressed interfering miRNA. In addition, it identifies the correlation between transcriptomic and epigenomic modifications and performs the ontology analysis of genes of interest.The developed tool could be used to study the regulatory effects of epigenetic factors. The application is publicly available in the GitHub repository (https://github.com/omicscodeathon/emimr).
  
  This work has been published in GigaByte Journal under a CC-BY 4.0 license (https://doi.org/10.46471/gigabyte.168), and has published the reviews under the same license.
  
  Reviewer 1. Haikuo Li
  
  Is there a clear statement of need explaining what problems the software is designed to solve and who the target audience is? No. Should be made more clear.
  
  Comments: The authors developed EMImR as an R toolkit and open-sourced software for analysis of bulk RNA-seq as well as epigenomic sequencing data including DNA methylation seq and non-coding RNA profiling. This work is very interesting and should be of interest to people interested in transcriptomic and epigenomic data analysis but without computational background. I have two major comments: 1. Results presented in this manuscript were only from microarray datasets and are kind of “old” data. Although these data types and sequencing platforms are still very valuable, I don’t think they are widely used as of today, and therefore, it may be less compelling to the audience. It is suggested to validate EMImR using additional more recently published datasets. 2. The authors studied bulk transcriptomic and epigenomic sequencing data. In fact, single-cell and spatially resolved profiling of these modalities are becoming the mainstream of biomedical research since those methods offer much better resolution and biological insights. The authors are encouraged to discuss some key references of this field (for example, PMIDs: 34062119 and 38513647 for single-cell multiomics; PMID: 40119005 for spatial multiomics sequencing), potentially as the future direction of package development. Re-review: The authors have answered my questions and added new content in the Discussion section as suggested.
  
  Reviewer 2. Weiming He
  
  Dear Editor-in-Chief, The EMImR developed by the author is a Shiny application designed for the identification of transcriptomic and epigenomic changes and data association. This program is mainly targeted at Windows UI users who do not possess extensive computational skills. Its core function is to identify the intersections between genetic and epigenetic modifications
  
  Review Recommendation I recommend that after making appropriate revisions to the current “Minor Revision”, the article can be accepted. However, the author needs to address the following issues.
  
  Major Issue The article does not provide specific information on the resource consumption (memory and time) of the program. This is crucial for new users. Although we assume that the resource consumption is minimal, users need to know the machine configuration required to run the program. Therefore, I suggest adding two columns for “Time” and “Memory” in Table 1.
  
  Minor Issues 1. GitHub Page The Table of Contents on the GitHub page provides a Demonstration Video. However, due to restricted access to YouTube in some regions, it is recommended to also upload a manual in PDF format named “EMImR_manual.pdf” on GitHub. In step 4 of the Installation Guide, it states that “All dependencies will be installed automaticly”. It is advisable to add a step: if the installation fails, prompt the user about the specific error location and guide the user to install the dependent packages manually first to ensure successful installation. Currently, the command “source(‘Dependencies_emimr.R’)” does not return any error messages, which is extremely inconvenient for novice users. The author can provide the maintainer's email address so that users can seek timely solutions when encountering problems
  
  R Version The author recommends using R - 4.2.1 (2022), which was released three years ago. The current latest version is R 4.5.1. It is suggested that the author test the program with the latest version to ensure its adaptability to future developments.
  
  Flowchart Suggestion It is recommended to add a flowchart to illustrate the sequential relationships among packages such as DESeq2 for differential analysis, clusterProfiler for clustering, enrichplot for plotting, and miRNA - related packages (this is optional).
  
  4.Function Addition Currently, the program seems to lack a button for saving PDFs, as well as functions for batch uploading, saving sessions, and one - click exporting of PDF/PNG files. It is recommended to add the “shinysaver” and “downloadHandler” functions to fulfill these requirements.
  
  Personalized Features and Upgrade Plan To attract more users, more personalized features should be added. The author can mention the future upgrade plan in the discussion section. For example, currently, DESeq2 is used for differential analysis, and in future upgrades, more methods such as PossionDis, NOIseq, and EBseq could be provided for users to choose from.
  
  Text Polishing Suggestions 6.1 Unify the usage of “down - regulated” and “downregulated”, preferably using the latter. 6.2 “R - studio version” ---》 “RStudio” 6.3 Lumian, ---》 Lumian 6.4 no login wall ---》 does not require user registration 6.5 Rewrite “genes were simultaneously differentially expressed and methylated” as “genes that were both differentially expressed and differentially methylated”. 6.6 Ensure that Latin names of species are in italics 6.7 make corresponding modifications to other sentences to improve the accuracy and professionalism of the language in the article.
  
  The above are my detailed review comments on this article. I hope they can provide a reference for your decision - making.
Visit annotations in context

Tags

Summary

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2025.10.16.682862v1
Oct 2025
www.biorxiv.org www.biorxiv.org

TinkerHap - A Novel Read-Based Phasing Algorithm with Integrated Multi-Method Support for Enhanced Accuracy

3
1. GigaScience 30 Oct 2025
  
  in GigaScience
  
  AbstractPhasing, the assignment of alleles to their respective parental chromosomes, is fundamental to studying genetic variation and identifying disease-causing variants. Traditional approaches, including statistical, pedigree-based, and read-based phasing, face challenges such as limited accuracy for rare variants, reliance on external reference panels, and constraints in regions with sparse genetic variation.To address these limitations, we developed TinkerHap, a novel and unique phasing algorithm that integrates a read-based phaser, based on a pairwise distance-based unsupervised classification, with external phased data, such as statistical or pedigree phasing. We evaluated TinkerHap’s performance against other phasing algorithms using 1,040 parent-offspring trios from the UK Biobank (Illumina short-reads) and GIAB Ashkenazi trio (PacBio long-reads). TinkerHap’s read-based phaser alone achieved higher phasing accuracies than all other algorithms with 95.1% for short-reads (second best: 94.8%) and 97.5% for long-reads (second best: 95.5%). Its hybrid approach further enhanced short-read performance to 96.3% accuracy and was able to phase 99.5% of all heterozygous sites. TinkerHap also extended haplotype block sizes to a median of 79,449 base-pairs for long-reads (second best: 68,303 bp) and demonstrated higher accuracy for both SNPs and indels. This combination of a robust read-based algorithm and hybrid strategy makes TinkerHap a uniquely powerful tool for genomic analyses.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf138), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 3: Julia Markowski
  
  In the presented Technical Note "TinkerHap - A Novel Read-Based Phasing Algorithm with Integrated Multi-Method Support for Enhanced Accuracy" by Hartmann et al., the authors introduce TinkerHap, a new hybrid phasing tool that primarily relies on read-based phasing for both short- and long-read sequencing data, but can additionally incorporate externally phased haplotypes, enabling it to build upon phase information derived from existing statistical or pedigree-based phasing approaches. This hybrid approach addresses an important and timely challenge in the field: integrating the complementary strengths of different phasing strategies to improve the accuracy and span of haplotype blocks, particularly for rare variants, or in variant-sparse genomic regions. The authors clearly articulate the limitations of existing approaches and present their solution in a manner that is both elegant and accessible. Design features such as multiple output formats and compatibility with third-party tools demonstrate a practical awareness of user needs. The authors evaluate TinkerHap using both short-read and long-read state-of-the-art benchmarking datasets, and compare its performance against commonly used phasing tools, demonstrating improvements in both phasing accuracy and haplotype block lengths. Overall, this is a well-conceived and thoughtfully implemented contribution to the phasing community.
  
  While the manuscript is overall well written, there are a few areas where additional clarification or extension would improve its impact. I recommend the following revisions to help clarify key aspects of the method, enhance the generalizability of the evaluation, and align the manuscript more closely with journal guidelines.
  
  Major Comments * (1) Limited scope of benchmarking The evaluation on the highly polymorphic MHC class II region is appropriate for highlighting TinkerHap's strengths in phasing rare variants in variable regions. However, the current evaluation on short -read based phasing is based on a ∼700 kb region selected for its high variant density, which limits the generalizability of the findings. Since the manuscript emphasizes improved performance in regions with sparse genetic variation, it would strengthen the work to include chromosome-wide or genome-wide benchmarks, particularly on short-read data. This would also provide a more balanced comparison with tools like SHAPEIT5, which predictably underperform in the MHC class II region due to their reliance on population allele frequencies and linkage disequilibrium patterns that are less effective for rare or private variants. * (2) Coverage and scalability The manuscript describes TinkerHap as scalable, but since the algorithm relies on overlapping reads, it is unclear how its performance varies with sequencing depth. Including a figure or supplementary analysis showing phasing accuracy, runtime, and memory usage at different coverage levels (particularly for short-read data) would help support this claim and guide users on appropriate coverage requirements. * (3) Clarify algorithmic novelty It would be helpful to elaborate on how TinkerHap's read-based phasing algorithm differs from existing approaches such as the weighted Minimum Error Correction (wMEC) framework implemented in WhatsHap. For example, what specifically enables TinkerHap's read-based mode to produce longer haplotype blocks than other read-based tools? * (4) Data description A brief characterization of the input datasets, such as the sequencing depth, as well as the number and average genomic distance of heterozygous variants in the MHC class II region and the GIAB trio data would provide important context for interpreting the reported phasing accuracy and haplotype block lengths. * (5) Manuscript structure Since the algorithm itself is the core novel contribution, it should be part of the results section, as well as the description of the evaluation currently in placed in the discussion. According to GigaScience's Technical Note guidelines, the method section should be reserved for "any additional methods used in the manuscript, that are not part of the new work being described in the manuscript."
  
  Minor Comments * (a) Novelty of hybrid approach While TinkerHap's ability to integrate externally phased haplotypes is valuable, similar functionality exists in other tools, for example, SHAPEIT can accept pre-phased scaffolds (including those generated from read-based phasing), and WhatsHap supports trio-based phasing. Consider refining the language to more precisely describe what is uniquely implemented in TinkerHap's hybrid strategy. It would be interesting to see how the presented results of using SHAPEIT's phasing output as input for TinkerHap compare to an approach of feeding TinkerHap's read-based phasing results into SHAPEIT. * (b) Reference bias claim The introduction states that read-based phasing is "independent of reference bias." While this approach is generally less susceptible to reference bias than statistical phasing, bias can still arise during the read alignment stage, potentially affecting downstream phasing. This point should be clarified. * (c) GIAB datasets The abstract mentions only the GIAB Ashkenazi trio, but later the Chinese trio is included in the analysis as well. Please clarify whether results are averaged across the two datasets. * (d) Tool version citation Please clarify in the text that the comparison was made using SHAPEIT5, not an earlier version.
  
  Recommendation: Minor Revision With additional clarification on generalizability and coverage sensitivity, this manuscript will make a valuable contribution to the field.
2. GigaScience 30 Oct 2025
  
  in GigaScience
  
  AbstractPhasing, the assignment of alleles to their respective parental chromosomes, is fundamental to studying genetic variation and identifying disease-causing variants. Traditional approaches, including statistical, pedigree-based, and read-based phasing, face challenges such as limited accuracy for rare variants, reliance on external reference panels, and constraints in regions with sparse genetic variation.To address these limitations, we developed TinkerHap, a novel and unique phasing algorithm that integrates a read-based phaser, based on a pairwise distance-based unsupervised classification, with external phased data, such as statistical or pedigree phasing. We evaluated TinkerHap’s performance against other phasing algorithms using 1,040 parent-offspring trios from the UK Biobank (Illumina short-reads) and GIAB Ashkenazi trio (PacBio long-reads). TinkerHap’s read-based phaser alone achieved higher phasing accuracies than all other algorithms with 95.1% for short-reads (second best: 94.8%) and 97.5% for long-reads (second best: 95.5%). Its hybrid approach further enhanced short-read performance to 96.3% accuracy and was able to phase 99.5% of all heterozygous sites. TinkerHap also extended haplotype block sizes to a median of 79,449 base-pairs for long-reads (second best: 68,303 bp) and demonstrated higher accuracy for both SNPs and indels. This combination of a robust read-based algorithm and hybrid strategy makes TinkerHap a uniquely powerful tool for genomic analyses.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf138), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 2: Yilei Fu
  
  TinkerHap is a read-based phasing algorithm designed to accurately assign alleles to parental haplotypes using sequencing reads. General comments: 1. The manuscript would greatly benefit from the inclusion of a flowchart or schematic overview of the TinkerHap algorithm. Given that the method incorporates multiple components—including read-based phasing, pairwise distance-based unsupervised classification, and optional integration with statistical phasing tools like ShapeIT—a visual diagram would help readers grasp the workflow more intuitively. Major comments: 1. The authors are missing experiments for long-read based phasing. How does TinkerHap performs with ShapeIT on PacBio long-reads? I would suggest the authors using the same phasing method class as their short-read analysis: TinkerHap+ShapeIT; TinkerHap; WhatsHap; HapCUT2; ShapeIT. Also I believe ShapeIT is capable to take long-read SNV/INDEL calls as vcf. 2. Following up on the point 1, the experimental design of this study is quite skewed. WhatsHap is not suitable for short-read sequencing data. It does not make sense to apply WhatsHap on short-read data. 3. I would caution the authors to read and potentially compare with SAPPHIRE (https://doi.org/10.1371/journal.pgen.1011092). This is a method that developed by the ShapeIT team for incorporating long-read sequencing data and ShapeIT. 4. To better justify the hybrid strategy, I recommend adding an analysis of sites where TinkerHap and ShapeIT disagree. Are these differences due to reference bias, read coverage, variant type, or true ambiguity? Such an evaluation would help users understand when to rely on the read-based output vs. ShapeIT, and enhance confidence in the merging strategy. Minor comments: 1. I could see the versions of the software in the supplementary github, but I think it is also important to include those in the manuscript. For example, shapeIT 2-5 are having quite different functions. The citation for ShapeIT in the manuscript is for ShapeIT 2, but the program that has been used is for ShapeIT 5. 2. Need to mention the benchmarking hardware information for runtime comparison. 3. "...a novel and unique phasing algorithm..." -> "...a novel phasing algorithm..."
3. GigaScience 30 Oct 2025
  
  in GigaScience
  
  AbstractPhasing, the assignment of alleles to their respective parental chromosomes, is fundamental to studying genetic variation and identifying disease-causing variants. Traditional approaches, including statistical, pedigree-based, and read-based phasing, face challenges such as limited accuracy for rare variants, reliance on external reference panels, and constraints in regions with sparse genetic variation.To address these limitations, we developed TinkerHap, a novel and unique phasing algorithm that integrates a read-based phaser, based on a pairwise distance-based unsupervised classification, with external phased data, such as statistical or pedigree phasing. We evaluated TinkerHap’s performance against other phasing algorithms using 1,040 parent-offspring trios from the UK Biobank (Illumina short-reads) and GIAB Ashkenazi trio (PacBio long-reads). TinkerHap’s read-based phaser alone achieved higher phasing accuracies than all other algorithms with 95.1% for short-reads (second best: 94.8%) and 97.5% for long-reads (second best: 95.5%). Its hybrid approach further enhanced short-read performance to 96.3% accuracy and was able to phase 99.5% of all heterozygous sites. TinkerHap also extended haplotype block sizes to a median of 79,449 base-pairs for long-reads (second best: 68,303 bp) and demonstrated higher accuracy for both SNPs and indels. This combination of a robust read-based algorithm and hybrid strategy makes TinkerHap a uniquely powerful tool for genomic analyses.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf138), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 1: Arang Rhie
  
  The authors present TinkerHap, a tool that accepts a variant call set and read alignment, and assigns heterozygous variants and reads to a particular haplotype based on a greedy pairwise distance-based classification. It accepts a pre-phased VCF as an option to further extend phased blocks. The results sound neat with statistics making it look the greatest compared to current state-of-the-art read alignment based phasing methods such as HapCut2, WhatsHap, and ShapeIT which uses statistical inference from reference panel data. However, there are several aspects the authors need to address to make their results more compelling. 1. The benchmarking was only performed on MHC Class II, which is a relatively small and easy to phase region based on the high level of heterozygosity. How does the statistics look when applied to the whole genome? After generating the phased read set, what % of reads can be accurately assigned to the original haplotype in the whole genome scale? To benchmark the latter, I would recommend doing it on HG002 phased variants and reads by using the HG002Q100 genome (https://github.com/marbl/hg002) - i.e. map the classified reads and calculate the coverage and accuracy based on where the reads align to. I would be curious to see how the MHC Class II phased read alignment looks like on the HG002Q100 truth assembly, on each haplotype. 2. When showing benchmarking results, key features are missing - 1) number of heterozygous variant sites are used for phasing, in addition to the Phased % (what's the denominator here?), 2) number of phase blocks, phase block NG50 and total length and 3) Show the NGx length distribution by plotting the cumulative covered genome length as a function of the longest to shortest phase block. 3. After phasing the variants (and reads), are the authors accurately able to type the HLA Class II genes? The goal of MHC phasing is to accurately genotype the HLA-genes. It is unclear to me why the authors applied their phasing on the 1,040 parent-offspring trios. I agree that it is 'phasable', however, it is unclear what the motivation here is - the MHC Class II is particularly known to have linked HLA types (e.g., HLA-DRB3 and HLA-DRB5 are inherited together depending on the HLA-DRB1 type, while in some haplotypes HLA-DRB3 is entirely missing), and depending on the HLA types and because the reference is incompletely representing this locus, there are multiple tools developed for genotyping this locus. I would be more convinced if the authors could show the HLA genotyping accuracy together based on their phasing method. 4. Is it possible to use additional data types to further extend the phase blocks, by using datasets such as low coverage PacBio data in addition to the short-read WGS? How about phasing with linked-reads or Hi-C? Both Whatshap and HapCut2 are specifically designed to combine such short and long-range datasets, giving the advantage of using such tools. 5. The authors claim their method is free from reference bias, which I strongly disagree. Using a bam file aligned to a reference inherently has the issue of mapping biases, so any such tools are limited by the reads that aligns incorrectly. Repeats, especially copy number variable region with collapses in the reference are very difficult to accurately phase. Any large structural variant not properly represented in the reference will cause problems due to unmapped reads. 6. In Methods, 2nd section - I would suggest to use allele 1 and allele 2 instead of 'reference' and 'alternative' in the equation and the code. This will increase the number of heterozygous 'phasable' variants that does not carry any reference allele.
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2025.02.16.638517v1
www.biorxiv.org www.biorxiv.org

Ultra-deep long-read metagenomics captures diverse taxonomic and biosynthetic potential of soil microbes

2
1. GigaScience 30 Oct 2025
  
  in GigaScience
  
  AbstractBackground Soil ecosystems have long been recognized as hotspots of microbial diversity, but most estimates of their complexity remain speculative, relying on limited data and extrapolation from shallow sequencing. Here, we revisit this question using one of the deepest metagenomic sequencing efforts to date, applying 148 Gbp of Nanopore long-read and 122 Gbp of Illumina short-read data to a single forest soil sample.Results Our hybrid assembly reconstructed 837 metagenome-assembled genomes (MAGs), including 466 high- and medium-quality genomes, nearly all lacking close relatives among cultivated taxa. Rarefaction and k-mer analyses reveal that, even at this depth, we capture only a fraction of the extant diversity: nonparametric models project that over 10 Tbp would be required to approach saturation. These findings offer a quantitative, technology-enabled update to long-standing diversity estimates and demonstrate that conventional metagenomic sequencing efforts likely miss the majority of microbial and biosynthetic potential in soil. We further identify over 11,000 biosynthetic gene clusters (BGCs), >99% of which have no match in current databases, underscoring the breadth of unexplored metabolic capacity.Conclusions Taken together, our results emphasize both the power and the present limitations of metagenomics in resolving natural microbial complexity, and they provide a new baseline for evaluating future advances in microbial genome recovery, taxonomic classification, and natural product discovery.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf135), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 2: Ameet Pinto
  
  The manuscript provides long-read mock community datasets from GridION and PromethION sequencing platforms along with draft genomes of mock community organisms sequenced on the Illumina Platform. The entire dataset is available for reuse by the research community and this is an extremely valuable resource that the authors have made available. While there are some analyses of the data included in the current manuscript, it is largely limited to summary statistics (which seems appropriate for a Data Note type manuscript) and some analyses of interest to the field (e.g., de novo metagenome assembly). It would have been helpful to have a more detailed evaluation of the de novo assembly and parameter optimization, but this may have been outside the scope of a Data Note type manuscript. I have some minor comments below to improve clarity of the manuscript.
  
  Minor comments: 1. Line 28-29: Would suggest that the authors provide the citation (15) without the statement in parenthesis or revised version of statement in parenthesis.
  
  "DNA extraction protocol" section 2. The last few lines were a little bit unclear. For instance: "45 ul (Even) and 225ul (Log) of the supernatant retained earlier…" It was a bit confusing. Possibly because the line "The standard was spun…before removing the supernatant and retaining." seems incomplete. I would suggest that the authors consider posting the entire protocol on protocols.io - as is quite possible that other groups may want to reproduce the sequencing step for these mock community standards. This would be particularly helpful as the authors suggest that the protocol was modified to increase fragment length.
  
  "Illumina sequencing" section: 3. Suggest that the authors improve clarity in this section by re-structuring this paragraph. For instance, early in paragraph it is stated that the pooled library was sequenced on four lanes on Illumina HiSeq 1500, but later stated that the even community was sequenced on a MiSeq.
  
  "Nanopore sequencing metrics" in results: 4. Table 2, Figure 3a. - please fix this to Figure 1a. 5. Figure 1B: The x-axis is "accuracy" while in this section Figure 1b is referred to as providing "quality scores". Please replace "quality scores" with "accuracy" for consistency. 6. Figure 1C: Please provide a legend mapping colors to "even" and "log". I realize this information is in Figure 1B, but would be helpful for the reader. Finally, there is no significant trend in sequencing speed over time. Considering this, would be easier to remove the Time component and just have a single panel with the GridION and PromethION sequencing speed for both even and log community in the same panel. It would make it easier to compare the different in sequencing speeds visually.
  
  "Illumina sequencing metrics" in results: 7. Table 5 is mentioned before Tables 3 and 4. Please correct this.
  
  "Nanopore mapping statistics" in results: 8. For Figure 2, consider also providing figure for the even community. 9. Further, it would be helpful to get clarity on where the data for Figure 2 is coming from. Is this from mapping of long-reads to mock community draft (I think so) or from the kraken analyses.
  
  "Nanopore metagenome assemblies" in results: 1. It is unclear how the genome completeness was estimated. 2. The consensus accuracy data is provided for all assemblies combined. Would be helpful if there was some discussion on accuracy of assemblies as a function of wtdgb2 parameters tested. There is some discussion of this in the "Discussion section", but would be helpful if this was laid out clearly in the results, with an additional appropriate figure/table.
2. GigaScience 30 Oct 2025
  
  in GigaScience
  
  AbstractBackground Soil ecosystems have long been recognized as hotspots of microbial diversity, but most estimates of their complexity remain speculative, relying on limited data and extrapolation from shallow sequencing. Here, we revisit this question using one of the deepest metagenomic sequencing efforts to date, applying 148 Gbp of Nanopore long-read and 122 Gbp of Illumina short-read data to a single forest soil sample.Results Our hybrid assembly reconstructed 837 metagenome-assembled genomes (MAGs), including 466 high- and medium-quality genomes, nearly all lacking close relatives among cultivated taxa. Rarefaction and k-mer analyses reveal that, even at this depth, we capture only a fraction of the extant diversity: nonparametric models project that over 10 Tbp would be required to approach saturation. These findings offer a quantitative, technology-enabled update to long-standing diversity estimates and demonstrate that conventional metagenomic sequencing efforts likely miss the majority of microbial and biosynthetic potential in soil. We further identify over 11,000 biosynthetic gene clusters (BGCs), >99% of which have no match in current databases, underscoring the breadth of unexplored metabolic capacity.Conclusions Taken together, our results emphasize both the power and the present limitations of metagenomics in resolving natural microbial complexity, and they provide a new baseline for evaluating future advances in microbial genome recovery, taxonomic classification, and natural product discovery.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf135), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 1: Lachlan Coin
  
  This is a great data resource, and will be invaluable to the community for testing/developing approaches for metagenome assembly. The aims are well described. Aside from a few queries I have below, the conclusions are largely supported by data shown; the manuscript is well written, and there are no statistical tests presented.
  
  Major comments: It seems that species assignment was done in two ways, one by using Kraken on the contigs (with a database of many bacterial/viral/fungal genomes) ; and also by mapping the reads directly to the illumina assemblies of the isolates in the mixture. It would be useful to be clearer in the results which approach was used in reporting the results. E.g. the sentence " We identify the presence of all 10 microbial species in the community, for both even and log samples, in expected proportions(Figure 2). " presumably relates to the analysis just mapping to the draft illumina assemblies?
  
  Also, It seems a little surprising that there were no false positive identification of species not present in the mixture. Is this because this analysis is based on mapping to the draft illumina isolate assemblies only (see previous comment). Or, if based on kraken assignment of contigs, perhaps repetitive and/or short contigs were filtered out?
  
  Could the authors present more statistics on the quality of the nanopore metagenomic assemblies, including the presence of misassemblies, any chimeric contigs, checkM completeness results; indel errors, mismatch errors, etc.
  
  Also, can the authors confirm that the assemblies were done on the full nanopore dataset (rather than, for example, on each isolate separately after mapping the reads to each isolate draft illumina assembly).
  
  The authors write : " For the even community, using wtdgb2 with varying parameter choices, we were able to assemble seven of the bacteria into single contigs." , however this does not seem to be borne out by figure 3? I could only see 4 species with at least one single contig assembly. Perhaps the authors could spell out which species have a single contig assembly?
  
  Minor Comments:
  
  In abstract "even and odd communities" should be ' evenly-distributed and log-distributed communities for clarity (this term is otherwise unclear to casual reader of abstract)
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2025.05.28.656579v1
www.biorxiv.org www.biorxiv.org

EssSubgraph improves performance and generalizability of mammalian essential gene prediction with large networks

2
1. GigaScience 30 Oct 2025
  
  in GigaScience
  
  AbstractPredicting essential genes is important for understanding the minimal genetic requirements of organisms, identifying disease-associated genes, and discovering potential drug targets. Wet-lab experiments for identifying essential genes are time-consuming and labor-intensive. Although various machine learning methods have been developed for essential gene prediction, both systematic testing with large collections of gene knockout data and rigorous benchmarking for efficient methods are very limited to date. Furthermore, current graph-based approaches require learning the entire gene interaction networks, leading to high computational costs, especially for large-scale networks. To address these issues, we propose EssSubgraph, an inductive representation learning method that integrates graph-structured network data with omics features for training graph neural networks. We used comprehensive lists of human essential genes distilled from the latest collection of knockout datasets for benchmarking. When applied to essential gene prediction with multiple types of biological networks, EssSubgraph achieved superior performance compared to existing graph-based and other models. The performance is more stable than other methods with respect to network structure and gene feature perturbations. Because of its inductive nature, EssSubgraph also enables predicting gene functions using dynamical networks with unseen nodes and it is scalable with respect to network sizes. Finally, EssSubgraph has better performance in cross-species essential gene prediction compared to other methods. Our results show that EssSubgraph effectively combines networks and omics data for accurate essential gene identification while maintaining computational efficiency. The source code and datasets used in this study are freely available at https://github.com/wenmm/EssSubgraph.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf136), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 2: Ju Xiang
  
  This paper proposes an inductive graph neural network model EssSubgraph for prediction of mammalian essential genes by integrating protein-protein interaction (PPI) networks with multi-omics data. Experimental results demonstrate the performance of methods, with additional validation showing effective cross-species prediction and biological consistency of predicted essential genes through functional enrichment analysis. This work is interesting, but some questions need to be clarified before publication. (1)The literature review lacks discussion about inductive vs. transductive graph learning approaches. Expanding this background would better contextualize the model's technical contributions. (2)While PCA dimensions for expression features were optimized (Figure 2A-B), other key hyperparameters like sampling depth (K-hop) deserve similar systematic evaluation to ensure optimal configuration. (3)What is RuLu? How does the author handle the issue of sample imbalance? Does CONCAT mean that two vectors are connected end-to-end to become a vector? If yes, does it mean that the number of rows of W is set to 1 in order to generate the final prediction output? (4)How to perform the sampling of nodes in EssSubgraph? The explanation of 'Subgraph' in the method name is not sufficient. (5)What are 'Edge perturbation' and 'feature perturbations'? How to perform? What is the performance of the algorithm in this article when only the network structure is used or only gene expression data is used? Or say, on the basis of the network, does adding gene expression data bring performance improvements, and vice versa? (6)The computational efficiency analysis focuses on memory usage but omits critical metrics like training time and scalability with respect to batch size or sampling strategies. Is it appropriate to directly compare 'Memory efficiency and network scalability'? The same method may require different amounts of memory and computation time when using different encoding technologies. (7)Minor revisions: --"and can predict identities of genes which can then predict the identities of genes that were either included in the training network or are unseen nodes." --Lines 244-251, "We used the EssSubgraph model mentioned above." The logical relationship here needs to be optimized. --"The model is an inductive deep learning method that generates low-dimensional vector representations for nodes in graphs and can predict identities of genes which can then predict the identities of genes that were either included in the training network or are unseen nodes." It is not clear. --Suggest to supplement statistical data on 'high density'. In terms of existing networks, they generally may not be called high-density. --Placing the perturbation curves of different methods in the same figure is more convenient for comparing the stability of different methods.
2. GigaScience 30 Oct 2025
  
  in GigaScience
  
  AbstractPredicting essential genes is important for understanding the minimal genetic requirements of organisms, identifying disease-associated genes, and discovering potential drug targets. Wet-lab experiments for identifying essential genes are time-consuming and labor-intensive. Although various machine learning methods have been developed for essential gene prediction, both systematic testing with large collections of gene knockout data and rigorous benchmarking for efficient methods are very limited to date. Furthermore, current graph-based approaches require learning the entire gene interaction networks, leading to high computational costs, especially for large-scale networks. To address these issues, we propose EssSubgraph, an inductive representation learning method that integrates graph-structured network data with omics features for training graph neural networks. We used comprehensive lists of human essential genes distilled from the latest collection of knockout datasets for benchmarking. When applied to essential gene prediction with multiple types of biological networks, EssSubgraph achieved superior performance compared to existing graph-based and other models. The performance is more stable than other methods with respect to network structure and gene feature perturbations. Because of its inductive nature, EssSubgraph also enables predicting gene functions using dynamical networks with unseen nodes and it is scalable with respect to network sizes. Finally, EssSubgraph has better performance in cross-species essential gene prediction compared to other methods. Our results show that EssSubgraph effectively combines networks and omics data for accurate essential gene identification while maintaining computational efficiency. The source code and datasets used in this study are freely available at https://github.com/wenmm/EssSubgraph.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf136), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 1: Yuchi Qiu
  
  Predicting essential genes are critical for identifying disease-associated genes. In this work, the authors EssSubgraph to predict essential genes by combining PPI and transcriptome data. EssSubgraph utilizes a GraphSAGE structure with subgraph sampling techniques to produce accurate, efficient, and scalable predictions. The method was tested and compared with multiple GNN-based models on 1) essential gene prediction, 2) predictions with randomly permuted node and edge features, and EssSubgraph shows advanced performance in accuracy, efficiency, and scalability. The author also performed GO analysis to show the interpretability of EssSubgraph to pick up genes with critical biological functions. Further analysis in predicting unseen genes and cross-species gene exemplified the strong generalizability. Overall, this work developed a novel and advanced GNN-based model with comprehensive studies. However, some clarifications are necessary to improve the paper readability. 1. The authors may give an overview about method motivations. For example, the authors may show method of DepMap and its limitation, then use this as motivation to describe why EssSubgraph is better. It looks like essential genes are very context specific, the authors may clarify what information is used to define essential genes? 2. The authors may introduce their method's unique features such as graph sampling, and its modifications to GraphSAGE. 3. The GNN model description of EssSubgraph is not clear enough. What kind of graph aggregation is used? Is the aggregation layer coupled with residual layer, and how many layers are used? What is the structure after all aggregation layers? I recommend creating an illustration of network architecture showing all these details. 4. Many PPI networks are cell-type- or species-specific. How was those cell-type and species information used in this work? 5. Line 150-152: clarification needed. 6. Line 222, should "learned linear transformation" be "learnable linear layer"?
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2025.07.21.665218v1
www.biorxiv.org www.biorxiv.org

GFFx: A Rust-based suite of utilities for ultra-fast genomic feature extraction

2
1. GigaScience 30 Oct 2025
  
  in GigaScience
  
  AbstractGenome annotations are becoming increasingly comprehensive due to the discovery of diverse regulatory elements and transcript variants. However, this improvement in annotation resolution poses major challenges for efficient querying, especially across large genomes and pangenomes. Existing tools often exhibit performance bottlenecks when handling large-scale genome annotation files, particularly for region-based queries and hierarchical model extraction. Here, we present GFFx, a Rust-based toolkit for ultra-fast and scalable genome annotation access. GFFx introduces a compact, model-aware indexing system inspired by binning strategies and leverages Rust’s strengths in execution speed, memory safety, and multithreading. It supports both feature- and region-based extraction with significant improvements in runtime and scalability over existing tools. Distributed via Cargo, GFFx provides a cross-platform command-line interface and a reusable library with a clean API, enabling seamless integration into custom pipelines. Benchmark results demonstrate that GFFx offers substantial speedups and makes a practical, extensible solution for genome annotation workflows.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf124), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 2: Andrew Su
  
  This paper describes GFFx, a new fast and efficient toolkit for working with GFF files. The tool describes a notable advance over curent state of the art, and the manuscript overall is well-written. I have only the following minor suggestions for consideration:
  
  In figure S1 and the corresponding discussion, the authors test GFFx on 4 different GFF annotation databases of differing sizes, and differences between the performance is attributed solely to the different dataset sizes. The authors should consider subsetting the largest annotation database (hg38) to more smoothly track how performance and memory use vary with annotation database size, and to confirm there are no organism-specific effects that could underlie the observed differences.
  
  The authors should consider changing the line charts in figures 2 and 3 to bar charts — I think the line implies a linear relationship between the tools along the x-axis that is not intended.
  
  For the purposes of benchmarking, the authors used random sampling to extract subsets of the benchmark datasets (e.g., lines 85 and 107). The authors should confirm that the exact same subsets were used when running each tool.
  
  In addition to depositing the code and benchmarks on Github, the authors should also deposit snapshots in an archival data repository (like Zenodo).
2. GigaScience 30 Oct 2025
  
  in GigaScience
  
  AbstractGenome annotations are becoming increasingly comprehensive due to the discovery of diverse regulatory elements and transcript variants. However, this improvement in annotation resolution poses major challenges for efficient querying, especially across large genomes and pangenomes. Existing tools often exhibit performance bottlenecks when handling large-scale genome annotation files, particularly for region-based queries and hierarchical model extraction. Here, we present GFFx, a Rust-based toolkit for ultra-fast and scalable genome annotation access. GFFx introduces a compact, model-aware indexing system inspired by binning strategies and leverages Rust’s strengths in execution speed, memory safety, and multithreading. It supports both feature- and region-based extraction with significant improvements in runtime and scalability over existing tools. Distributed via Cargo, GFFx provides a cross-platform command-line interface and a reusable library with a clean API, enabling seamless integration into custom pipelines. Benchmark results demonstrate that GFFx offers substantial speedups and makes a practical, extensible solution for genome annotation workflows.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf124), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 1: Xingtan Zhang
  
  The overall research appears comprehensive; however, further attention to the tool's capabilities and methodological rigor would strengthen its validity and broader applicability.
  
  In the "Performance benchmark in annotation indexing" section, the authors utilized genome annotations from four species (Homo sapiens hg38, Pungitius sinensis ceob_ps_1.0, Drosophila melanogaster dm6, and Arabidopsis thaliana tair10.1) as representatives for benchmarking and subsequent analyses. Nevertheless, a robust GFF processing suite should ideally demonstrate reliability across a broader spectrum of genome types, irrespective of their frequency of use. To enhance the generalizability of GFFx and cater to a wider user base, it is recommended that additional genomes—such as those of Triticum aestivum, Mus musculus, and Sus scrofa—be included in the benchmarks. This would better validate the tool's robustness across species with varying genome complexities.
  
  While the 20-kb interval length used in the region-based retrieval benchmarks is biologically relevant, corresponding to typical gene sizes, it does not fully capture the diversity of genomic query scenarios. To comprehensively assess GFFx's performance across diverse genomic contexts, it is suggested that supplementary benchmarks be conducted using interval lengths of 10 kb and 100 kb. This would help validate the tool's robustness across varying interval scales, which is critical for its practical utility in diverse research workflows.
  
  To further broaden the software's applicability, it is recommended to incorporate an additional functionality that enables the extraction of the number of reads covering specific intervals from BAM files based on positional information derived from GFF3 files, thereby facilitating the calculation of sequencing depth. This feature would be analogous to the functionality provided by bedtools coverage, enhancing GFFx's utility in integrating genome annotation data with sequencing read coverage analyses.
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2025.08.08.669426v1
www.biorxiv.org www.biorxiv.org

On the path to reference genomes for all biodiversity: lessons learned and laboratory protocols created in the Sanger Tree of Life core laboratory over the first 2000 species

2
1. GigaScience 30 Oct 2025
  
  in GigaScience
  
  AbstractSince its inception in 2019, the Tree of Life programme at the Wellcome Sanger Institute has released high-quality, chromosomally-resolved reference genome assemblies for over 2000 species. Tree of Life has at its core multiple teams, each of which are responsible for key components of the ‘genome engine’. One of these teams is the Tree of Life core laboratory, which is responsible for processing tissues across a wide range of species into high quality, high molecular weight DNA and intact RNA, and preparing tissues for Hi-C. Here, we detail the different workflows we have developed to successfully process a wide variety of species, covering plants, fungi, chordates, protists, arthropods, meiofauna and other metazoa. We summarise our success rates and describe how to best apply and combine the suite of current protocols, which are all publicly available at protocols.io.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf119), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 2: Lars Podsiadlowski
  
  The Authors provide a profound overview over their aim to generate genome information for a wide range of species in the tree of life project. As a scientist with hands on experience on genome sequencing, I greatly appreciated all the information here, especially detail on the differences experienced with different taxa, as this is probably the most important lesson here, that there is high variation and strategies must be adapted to that. I am also happy that many of the approaches are also available as detailed online protocols, which really helps a lot in practical work. The selected examples of size profiles also give a good impression on what differences can be expected, e.g. with different extraction methods applied to the same species. Although detailed, I think that the authors provide a lot of relevant information here and would not change that. I did also not spot any errors or flaws in the text.
  
  One thing that might be changed is the title. From first reading it I expected to hear also about assembly strategies, as well as some comparisons and oddities of the yielded genomes. It is great to have the manuscript as it is, but I like to see it better reflected in the title that the main focus here is on the wet lab part, especially the extraction of good quality DNA/RNA.
  
  I have some issues with the figures: Fig. 7: there is no mention in the legend about the y-axis scale - I assume from the text that it refers to Gigabases? Figs. 8,9, 11-15: It is a bit confusing until I realised the log scale of the numbers. I would prefer to see it not with a log scale, but in a similar way as Fig. 6, with percentages on display, and an accompanying species number somewhere on the side. In the way it is shown now, the failed proportion looks so small and gives a wrong impression. Maybe overthink the colors, I would prefer another color for the Pass ULI, which is more similar in tone with Pass, because at the moment pass ULI and fail are similar in tone and brightness and appear as being opposed to the green "pass", while the difference between "fail" and the rest should be more pronounced in my view.
2. GigaScience 30 Oct 2025
  
  in GigaScience
  
  AbstractSince its inception in 2019, the Tree of Life programme at the Wellcome Sanger Institute has released high-quality, chromosomally-resolved reference genome assemblies for over 2000 species. Tree of Life has at its core multiple teams, each of which are responsible for key components of the ‘genome engine’. One of these teams is the Tree of Life core laboratory, which is responsible for processing tissues across a wide range of species into high quality, high molecular weight DNA and intact RNA, and preparing tissues for Hi-C. Here, we detail the different workflows we have developed to successfully process a wide variety of species, covering plants, fungi, chordates, protists, arthropods, meiofauna and other metazoa. We summarise our success rates and describe how to best apply and combine the suite of current protocols, which are all publicly available at protocols.io.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf119), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 1: Yuan Deng
  
  The manuscript focuses on the entire experimental processes involved in the generation of high-quality genomes and proposes a set of standardized and modular experimental process protocols. The innovation of these protocols is that they can be flexibly combined according to different taxa, tissue types and sample quality, which greatly improves the flexibility and efficiency of the experiment and provides a reference experimental process for researchers in this field to follow. The manuscript also explore the specific challenges and solutions of different taxa in the experimental procedure of sample processing, DNA extraction, shearing, cleaning, Hi-C and RNA extraction, providing valuable guidance for future research. Meanwhile, the manuscript reviews the experimental protocols for the production of genome data of more than 2,000 species, which is in line with the journal's focus on biological big data. Therefore, I consider the subject matter and content of this work are appropriate for publishing in this journal. I only have some minor requests for revision:
  
  1.Sample processing: (1) Sampling of rare and endangered species: for such a large-scale study of the "Tree of Life", it is bound to involve some species that are difficult to obtain conventional tissues, therefore the manuscript may include a section on how to select suitable tissues for subsequent experiments, especially for rare species. And is it possible to provide a prioritized list of tissues selection based on the difficulty of extracting high-quality DNA? (2) Processing and extraction of unconventional tissues: accordingly, it is recommended to add content regarding sample processing and extraction procedures for unconventional tissues, e.g., any particular methods to improve the quality of DNA extraction. (3) Sample contamination problem is often overlooked yet critical: how to reduce sample contamination problems in large-scale sample processing and other experimental processes? How to exclude sample or experimental contamination from data?
  
  2.Analyzing method limitations: while the manuscript mentions some challenges that may be encountered in the processing of samples from various taxa, there is little discussion on the limitations of those experimental methods. It is recommended to expand the content of the limitations of the methods, such as some methods may not work well for certain types of samples, or some steps may have factors that affect the accuracy of the results, so that readers can have a more comprehensive understanding of the scope of application and potential problems of the method.
  
  3.The manuscript is currently organized according to the experimental procedures, but some of the more relevant components could probably be consolidated to reduce redundant information and improve the readability. The authors studied the experimental conditions for different taxa in long read sequencing and Hi-C library preparation, but fail to emphasize their relevance in the introduction.
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2025.04.11.648334v1
www.biorxiv.org www.biorxiv.org

Improving the Reliability and Quality of Nextflow Pipelines with nf-test

2
1. GigaScience 30 Oct 2025
  
  in GigaScience
  
  ABSTRACTThe workflow management system Nextflow builds together with the nf-core community an essential ecosystem in Bioinformatics. However, ensuring the correctness and reliability of large and complex pipelines is challenging, since a unified and automated unit-style testing framework specific to Nextflow is still missing. To provide this crucial component to the community, we developed the testing framework nf-test. It introduces a modular approach that enables pipeline developers to test individual process blocks, workflow patterns and entire pipelines in insolation. nf-test is based on a similar syntax as Nextflow DSL 2 and provides unique features such as snapshot testing and smart testing to save resources by testing only changed modules. We show on different pipelines that these improvements minimize development time, reduce test execution time by up to 80% and enhance software quality by identifying bugs and issues early. Already adopted by dozens of pipelines, nf-test improves the robustness and reliability in pipeline development.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf130), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 2: Katalin Ferenc
  
  1) General assessment of the work.
  
  It is a very nice addition to the scientific community, an important step towards standardizing the development and maintenance of software for bioinformatics pipelines. It is not a trivial task to adapt unit testing concepts to pipelines. nf-test has already been used by the community and has been in a feedback loop with the users. Thus, its usability has been constantly improving, both through the efforts of the developers and additional plugins from the user base, highlighting the ease of contribution to the nf-test software base. The text is well written and easy to follow. However, some concepts could be better described and discussed for the readers.
  
  2) Specific comments for revision:
  
  a) Major comments; - The authors should refer to pytest-workflow in the introduction, along with NFTest, as both are used for comparison. - Test coverage is helpful to identify which lines are vulnerable to changes. For the calculation of the test coverage in nf-test, indirect tests are considered. Does it mean that if a single integration test is written, then all called modules are considered covered? Please clarify or argue why this is a good strategy. - An interesting idea in nf-test is to use snapshot testing for modules, workflows, and pipelines. As the authors mention, this has been used in web development. According to the cited reference, it is especially used for frontend code and has been noted as a quick but fragile way of testing. This is because snapshot testing does not provide insight into the correctness of the code, but only asserts that there was no change. It is beneficial that this test checks for unexpected changes that unit tests might miss. In the "Code reduction through snapshot testing" section, the authors highlight cases when snapshot testing results in failed tests: 1) when there is a change in the code due to a bug, and 2) when default parameters are modified. We understand that snapshot testing in the context of pipeline development is useful in two scenarios: 1. when the pipeline itself is being refactored, the output of each module should stay the same. In this case, snapshot testing is used to fix the output of the tools, and a failing test highlights that the Nextflow code wrapping the tools is incorrectly integrated (i.e., connected to each other). 2. pipeline / module versioning requires knowledge about changes in the underlying tools. In this case, snapshot testing helps because any failure in the tests flags a change. As there is no oracle, one would not know if the bug was introduced or fixed. However, from the pipeline development perspective, the only thing that matters is that there should be a new version. According to our understanding, in any other case, a more traditional approach should be preferred, where there is an oracle knowing about expected file formats, content, or errors. Otherwise, there is a risk of adding many tests that unnecessarily fail, causing increased development time. Please add explicit discussion about these scenarios, or other ones based on your insights, highlighting when snapshot testing is applicable/appropriate during pipeline development. Please add a summary of other types of tests (e.g., assertions about file or channel content, verification of tool execution given input data, and error handling checks) that can be run within the nf-test framework. b) Minor comments: - In the "evaluation and validation" section, the authors describe that they ran tests in nf-core/modules between github versions. Please clarify that these modules were already covered by tests. - Table 4 is referenced in the Discussion section. It would be better to move the comparison between tools to the Results section. - On page 16, typo: "queuing system" - Figure 2 title typo: "nf-tet" - Figure 2: please add comments about the time cost of adding tests during the development, as it is highlighted on the figure. - Page 22 typo: "savings areis calculated" - Abstract: "Build on…" should be "Built on…" - Shouldn't TM2 linked to M3 be TM3 in Figure 1?
2. GigaScience 30 Oct 2025
  
  in GigaScience
  
  ABSTRACTThe workflow management system Nextflow builds together with the nf-core community an essential ecosystem in Bioinformatics. However, ensuring the correctness and reliability of large and complex pipelines is challenging, since a unified and automated unit-style testing framework specific to Nextflow is still missing. To provide this crucial component to the community, we developed the testing framework nf-test. It introduces a modular approach that enables pipeline developers to test individual process blocks, workflow patterns and entire pipelines in insolation. nf-test is based on a similar syntax as Nextflow DSL 2 and provides unique features such as snapshot testing and smart testing to save resources by testing only changed modules. We show on different pipelines that these improvements minimize development time, reduce test execution time by up to 80% and enhance software quality by identifying bugs and issues early. Already adopted by dozens of pipelines, nf-test improves the robustness and reliability in pipeline development.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf130), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 1: Jose Espinosa-Carrasco
  
  The article presents nf-test, a new modular and automated testing framework designed specifically for Nextflow workflows, a widely used workflow management system in bioinformatics. nf-test aims to help developers improve the reliability and maintainability of complex Nextflow pipelines. The framework includes very useful features such as snapshot testing, which assesses the computational repeatability of the results produced by the execution of a pipeline or its components and smart testing which optimises computational resources by only executing tests on the parts of the pipeline that were modified, reducing overall run time. Notably, nf-test can be integrated into CI workflows and has already been adopted by the nf-core community, demonstrating its utility and maturity in real-world scenarios
  
  General comments:
  
  The manuscript could benefit from reordering some sections to follow a more consistent structure and by removing redundant explanations. I think it would be nice to include one limitation of nf-test, the fact that reproducing previous results does not necessarily imply biological correctness. This point is not entirely clear in the current version of the manuscript (see my comment below). Another aspect that could improve the manuscript is the inclusion of at least one reference or explanation of how nf-test can be applied outside nf-core pipelines, as all the provided examples are currently restricted to nf-core.
  
  Specific comments:
  
  On page 3, the sentence "Thus, maintenance requires substantial time and effort to manually verify that the pipeline continues to produce scientifically valid results" could be more precise. I would argue that identical results across versions do not guarantee scientific validity; they merely confirm consistency with previous outputs. True scientific validity requires comparison against a known ground truth or standard.
  
  On page 4, in the sentence "It is freely available, and extensive documentation is provided on the website", I think it would be nice to include the link to the documentation.
  
  In the "Evaluation and Validation" section (page 8), it would be helpful to briefly state the goal of each evaluated test, as is done with the nf-gwas example. ou could include something similar for the nf-core/fetchngs and modules examples (e.g. to assess resource optimization through smart testing). Also, the paragraph references the "--related-tests" option, which could benefit from a short explanation of what it does. Lastly, the order in which the pipelines are presented in this section differs from the order in the Results, which makes the structure a bit confusing.
  
  The sections titled "Unit testing in nf-test", "Test case execution", "Smart testing and parallelization", "Snapshot testing", and "Extensions for bioinformatics" seem more appropriate for the Materials and Methods section, as they describe the design and functionality of nf-test rather than reporting actual results. Please ignore this comment if the current structure follows specific journal formatting requirements that I may not be aware of.
  
  The Snapshot testing discussion in the Results section feels somewhat repetitive with its earlier explanation. Consider combining both discussions or restructuring the content to reduce duplication.
  
  On page 11, the sentence "In these cases, MD5 sums cannot be used and validating the dynamic output content can be time-intensive" is not entirely clear to me, does it mean that it is time consuming to implement the test for this kind of files or that the validation of the files is time consuming?
  
  On page 12, the sentence "Second, we analyzed the last 500 commits..." is confusing because this is actually the third point in the "Evaluation and Validation" section, as mentioned before. reordering would improve clarity.
  
  On page 14, the authors state "However, changes (b) and (c) lead to incorrect output results without breaking the pipeline. Thus, these are the worst-case scenarios for a pipeline developer." While this is mostly true, I would also add that a change in parameters may produce different, but not necessarily incorrect, results—some may even be more biologically meaningful. I suggest to acknowledge this.
  
  Typos:
  
  In the abstract: "Build on a similar syntax as Nextflow DSL2" should be corrected to "Built on a similar syntax as Nextflow DSL2".
  
  In the legend of Figure 2 (page 19): "nf-tet" should be "nf-test".
  
  In the legend of Table 2: "Time savings areis calculated..." should be "Time savings are calculated..."
  
  Recommendation:
  
  Given the relevance and technical contributions of the manuscript, I recommend its publication after addressing the minor revisions summarized above.
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2024.05.25.595877v1
www.biorxiv.org www.biorxiv.org

CryoDataBot: a pipeline to curate cryoEM datasets for AI-driven structural biology

3
1. GigaScience 30 Oct 2025
  
  in GigaScience
  
  AbstractCryogenic electron microscopy (cryoEM) has revolutionized structural biology by enabling atomic-resolution visualization of biomacromolecules. To automate atomic model building from cryoEM maps, artificial intelligence (AI) methods have emerged as powerful tools. Although high-quality, task-specific datasets play a critical role in AI-based modeling, assembling such resources often requires considerable effort and domain expertise. We present CryoDataBot, an automated pipeline that addresses this gap. It streamlines data retrieval, preprocessing, and labeling, with fine-grained quality control and flexible customization, enabling efficient generation of robust datasets. CryoDataBot’s effectiveness is demonstrated through improved training efficiency in U-Net models and rapid, effective retraining of CryoREAD, a widely used RNA modeling tool. By simplifying the workflow and offering customizable quality control, CryoDataBot enables researchers to easily tailor dataset construction to the specific objectives of their models, while ensuring high data quality and reducing manual workload. This flexibility supports a wide range of applications in AI-driven structural biology.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf127), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 3: Nabin Giri
  
  The paper presents a flexible, integrated framework for filtering and generating customizable cryo-EM training datasets. It builds upon previously available strategies for preparing cryo-EM datasets for AI-based methods, extending them with a user-friendly interface that allows researchers to enter query parameters, interact directly with the Electron Microscopy Data Bank (EMDB), extract and parse relevant metadata, apply quality control measures, and retrieve associated structural data (cryo-EM maps and atomic models).
  
  While the manuscript improves upon Cryo2StructData and similar data pipelines used in ModelAngelo/DeepTracer, the innovation claim would be strengthened by a deeper technical comparison, for example quantifying the performance impact of each quality control step in isolation. Some filtering and preprocessing concepts (e.g., voxel resampling, redundancy handling) are not entirely new, so a more explicit discussion of how CryoDataBot's implementations differ from prior work and why these differences matter would improve the manuscript. I do not think its challenging to change the resampling or the grid division parameter on the scripts provided by Cryo2StructData github repo or scripts available in ModelAngelo github repo.
  
  The benchmarking is mainly limited to ribosome datasets. While this choice is understandable for demonstration purposes, the generalizability to other macromolecules (e.g., membrane proteins, small complexes) is not shown. This can include a small-scale test on a different class of structures (e.g., protein's C-alpha positions, backbone atom position or amino acid type prediction (more difficult one) could strengthen the claim of broad applicability. Since the technical innovation limited, this can help to improve the paper.
  
  The authors state that CryoDataBot ensures reproducibility and provides datasets for AI-method benchmarking. However, EMDB entries can be updated over time (e.g., through reprocessing, resolution improvements, model re-fitting, or correction of atomic coordinates). In my opinion, in the strict sense, reproducibility (producing identical datasets) depends on versioning of EMDB/PDB entries. Without version locking, CryoDataBot ensures procedural reproducibility but not data immutability. The manuscript should either explain how reproducibility is maintained (e.g., version control, archived snapshots) or clarify that reproducibility refers to the workflow, not necessarily the exact dataset content, unless version dataset are provided, as done in Cryo2StructData.
  
  Some other concerns: (1) The "Generating Structural Labels" section has missing technical details. Please provide more information on how the labels are generated, including labeling radius selection, and how ambiguities are resolved if any encountered. A suggestions on how the user should determine the radius and also the grid size (64^3 or other) would be beneficial. (2) The manuscript states on the adaptive density normalization part : "This method is more flexible and removes more noise than the fixed-threshold approaches commonly used in prior studies." What does noise and signals mean here? - there is a separate body of AI-based works developed for reducing noise such as DeepEMhancer, EMReady to name few. Any metric to support this claim? (3) The manuscript states: "To assess dataset redundancy, we analyzed structural similarity between entries based on InterPro (IPR) domain annotations." Is this a new approach introduced here, or an established practice? How does it compare with sequence-based similarity measures? Or Structure-based similarity such as Foldseek? (4) The statement "underscoring the dataset's superior quality and informativeness" is strong. Is it possible to provide more concrete, quantitative evidence to support this, ideally beyond the U-Net training metrics.? (5) Is there a case where there is multiple PDB IDs for the cryo-EM density map? If so how is a specific atomic model chosen in such case?
2. GigaScience 30 Oct 2025
  
  in GigaScience
  
  AbstractCryogenic electron microscopy (cryoEM) has revolutionized structural biology by enabling atomic-resolution visualization of biomacromolecules. To automate atomic model building from cryoEM maps, artificial intelligence (AI) methods have emerged as powerful tools. Although high-quality, task-specific datasets play a critical role in AI-based modeling, assembling such resources often requires considerable effort and domain expertise. We present CryoDataBot, an automated pipeline that addresses this gap. It streamlines data retrieval, preprocessing, and labeling, with fine-grained quality control and flexible customization, enabling efficient generation of robust datasets. CryoDataBot’s effectiveness is demonstrated through improved training efficiency in U-Net models and rapid, effective retraining of CryoREAD, a widely used RNA modeling tool. By simplifying the workflow and offering customizable quality control, CryoDataBot enables researchers to easily tailor dataset construction to the specific objectives of their models, while ensuring high data quality and reducing manual workload. This flexibility supports a wide range of applications in AI-driven structural biology.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf127), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 2: Dong Si
  
  This paper discusses CryoDataBot, which creates cryoEM datasets for the use of training with the abilities to filter out based on redundancy, MMF and other user defined parameters. Here are some comments:
  
  The data labeling just has helix, sheet, coil, and RNA. The labeling should also consider DNA and other structures.
  
  The introduction of a Volume Overlap Fraction (VOF) score to validate map-model fitness (MMF) is a novel method to assess global alignment. However, VOF relies on summing and binarizing 2D projections which may have limitations. It is not clear how sensitive the VOF score is to the binarization process or how it handles complex, non-globular shapes. The paper would be strengthened if the authors could provide more justification for this specific metric over other global 3D correlation scores. An analysis of specific examples of map-model pairs that were discarded by the VOF score but not by the Q-score would be informative.
  
  The authors acknowledge the trade-off between higher precision and lower recall that results from overly stringent filtering. While increased precision clearly benefits tasks like model refinement, the resulting reduced recall could be a significant hinder de novo modeling which is dependent upon capturing the entirety of a structure, even with lower confidence. This point could be elaborated on. Is this an area for future work, .e.g. developing pre-configured filtering settings for various downstream tasks, like a Precision vs. Recall bias setting? This might increase utility based on application.
  
  The retraining of CryoREAD is a practical validation of the pipeline's utility for RNA modeling, however the experimental dataset used is exclusively from ribosomes. Ribosomes were selected because they contain both protein and RNA and are abundant in the EMDB but they may not represent the full diversity of RNA structures. The authors rightly note that training set composition affects performance. It would be helpful to further discuss the potential shortcomings of an exclusively ribosome-based training set and possible impact to the retrained CryoREAD model's use validating other classes of RNA.
  
  The author should consider benchmarking on the other SOTA protein-RNA-DNA modeling tools. Right now it is only benchmarked on their own CryoREAD which is just a RNA/DNA modeling tool.
  
  I tried installing CryoDataBot and looks like it requires python version 3.8 or higher but isn't specified anywhere in the paper or the site.
  
  Many references and citations are off and wrong.
3. GigaScience 30 Oct 2025
  
  in GigaScience
  
  AbstractCryogenic electron microscopy (cryoEM) has revolutionized structural biology by enabling atomic-resolution visualization of biomacromolecules. To automate atomic model building from cryoEM maps, artificial intelligence (AI) methods have emerged as powerful tools. Although high-quality, task-specific datasets play a critical role in AI-based modeling, assembling such resources often requires considerable effort and domain expertise. We present CryoDataBot, an automated pipeline that addresses this gap. It streamlines data retrieval, preprocessing, and labeling, with fine-grained quality control and flexible customization, enabling efficient generation of robust datasets. CryoDataBot’s effectiveness is demonstrated through improved training efficiency in U-Net models and rapid, effective retraining of CryoREAD, a widely used RNA modeling tool. By simplifying the workflow and offering customizable quality control, CryoDataBot enables researchers to easily tailor dataset construction to the specific objectives of their models, while ensuring high data quality and reducing manual workload. This flexibility supports a wide range of applications in AI-driven structural biology.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf127), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 1: Ashwin Dhakal
  
  The authors introduce CryoDataBot, a GUI‐driven pipeline for automatically curating cryo EM map / model pairs into machine learning-ready datasets. The study is timely and addresses a real bottleneck in AI driven atomic model building. The manuscript is generally well written and benchmarking experiments (U Net and CryoREAD retraining). Nevertheless, several conceptual and presentation issues should be resolved before the work is suitable for publication:
  
  1 All quantitative tests focus on ribosome maps in the 3-4 Å range. Because ribosomes are unusually large and RNA rich, it is unclear whether the curation criteria (especially Q score ≥ 0.4 and VOF ≥ 0.82) generalise to smaller or lower resolution particles. Please include at least one additional macromolecule class (e.g. membrane proteins or spliceosomes) or justify why the current benchmark is sufficient.
  
  2 The manuscript adopts fixed thresholds (Q score 0.4; 70 % similarity; VOF 0.82) yet does not show how sensitive downstream model performance is to these values. A short ablation (e.g. sweep the Q score from 0.3-0.6) would help readers reuse the tool sensibly.
  
  3 Table 1 claims CryoDataBot "addresses omissions" of Cryo2StructData, but no quantitative head to head benchmarking is provided (e.g. train the same U Net on Cryo2StructData). Please add such a comparison or temper the claim.
  
  4 For voxel wise classification, F1 scores are affected by severe class imbalance (Nothing ≫ Helix/Sheet/Coil/RNA). Report per class support (number of positive voxels) and consider complementary instance level or backbone trace metrics.
  
  5 In Fig. 4 the authors show that poor recall/precision partly stems from erroneous deposited models. Quantify how often this occurs across the 18 map test set and discuss implications for automated QC inside CryoDataBot.
  
  6 The authors note improved precision but slightly reduced recall in CryoDataBot-trained models. This is explained, but strategies to mitigate this tradeoff are not discussed. Could ensemble learning, soft labeling, or multi-resolution data alleviate the recall drop?
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2025.09.09.675185v1
www.biorxiv.org www.biorxiv.org

Reproducible processing of TCGA regulatory networks

2
1. GigaScience 30 Oct 2025
  
  in GigaScience
  
  AbstractBackground Technological advances in sequencing and computation have allowed deep exploration of the molecular basis of diseases. Biological networks have proven to be a useful framework for interrogating omics data and modeling regulatory gene and protein interactions. Large collaborative projects, such as The Cancer Genome Atlas (TCGA), have provided a rich resource for building and validating new computational methods resulting in a plethora of open-source software for downloading, pre-processing, and analyzing those data. However, for an end-to-end analysis of regulatory networks a coherent and reusable workflow is essential to integrate all relevant packages into a robust pipeline.Findings We developed tcga-data-nf, a Nextflow workflow that allows users to reproducibly infer regulatory networks from the thousands of samples in TCGA using a single command. The workflow can be divided into three main steps: multi-omics data, such as RNA-seq and methylation, are downloaded, preprocessed, and lastly used to infer regulatory network models with the netZoo software tools. The workflow is powered by the NetworkDataCompanion R package, a standalone collection of functions for managing, mapping, and filtering TCGA data. Here we show how the pipeline can be used to study the differences between colon cancer subtypes that could be explained by epigenetic mechanisms. Lastly, we provide pre-generated networks for the 10 most common cancer types that can be readily accessed.Conclusions tcga-data-nf is a complete yet flexible and extensible framework that enables the reproducible inference and analysis of cancer regulatory networks, bridging a gap in the current universe of software tools.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf126), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 2: Jérôme Salignon
  
  This manuscript presents tcga-data-nf, a Nextflow-based pipeline for downloading, preprocessing, and analyzing TCGA multi-omic data, with a focus on gene regulatory network (GRN) inference. The workflow integrates established bioinformatics tools (PANDA, DRAGON, and LIONESS) and adheres to best practices for reproducibility through containerization (Docker, Conda, and Nextflow profiles). The authors demonstrate the utility of their pipeline by applying it to colorectal cancer subtypes, identifying potential regulatory interactions in TGF-β signaling. The manuscript is well-written and well-structured and provides sufficient methodological details, as well as Jupyter notebooks, for reproducibility. However, there are some areas that require clarification and improvement for acceptance in GigaScience, particularly regarding the scope of the tool, the quality of the inferred regulatory networks, the case study figure, benchmarking, statistical validation, and parameters.
  
  Major comments:
  
  While the pipeline is well designed and executed, the overall impact of the tool feels somewhat limited, especially for a journal like GigaScience, due to its pretty specific application to building GRNs in TCGAs, the relatively small number of parameters, the support of only 2 omics type, and the lack of novel algorithms. To increase the impact of this tool I would recommend adding functionalities, such as:
  
  o Supporting additional tools. A great strength of the pipeline is the integration with the Network Zoo (NetZoo) ecosystem. However, only three tools are included from NetZoo. Including additional tools would likely increase the scope of users interested in using the pipeline. In particular, an important weakness of the current pipeline is that it is not possible to conduct differential analysis between different networks, which prevents users from identifying the most significant differences between two networks of interest (e.g., CMS2 vs CMS4). The NetZoo contains different tools to conduct such analyses, such as Alpaca 1 or Crane 2, thus this may be implemented to make the pipeline more useful to a broader user base.
  
  o Adding parameters. A strength of the pipeline is the ability to customize it using various parameters. However, as such the pipeline does not offer many parameters. It would be beneficial to make the pipeline a bit more customizable. For example, novel parameters could be: adding options for excluding selected samples, using different batch correction methods, different methods to map CpGs to genes, additional normalization methods, and additional quality controls (e.g., PCA for methylation samples, md5sum checks). These are just examples and do not need to be all implemented but adding some extra parameters would help make the pipeline more appealing and customizable to various users.
  
  The quality of the inferred regulatory networks is hard to judge. There are no direct comparisons with any other tools.
  
  o For instance, it is mentioned in the text that GRAND networks were derived using a fixed set of parameters, but it could be helpful to show a direct comparison between GRNs built from your tools with those from GRAND. This could reveal how the ability to customize GRNs using the pipeline's parameters helps in getting better biological insights.
  
  o Alternatively, or in addition, one could compare how networks built by your method fare in comparison to networks built from other methods, like RegEnrich 3 or NetSeekR 4, in terms of biological insights, accuracy, scalability, speed, functionalities and/or memory usage.
  
  o Another angle to judge the regulatory networks would be to check in a case study if the predicted gene interactions between disease and control networks are enriched in disease and gene-gene interactions databases, such as DisGeNet 5.
  
  Figure 2 needs re-work:
  
  o Panel A and C: text is too small. "tf" should be written TF. "oi" should have another name. These panels might be moved to the supplements.
  
  o Panel D is confusing. Without significance it is hard to understand what the point of this panel is. I can see that certain TFs are cited in the main text but without information about significance, these may seem like cherry-picking. The legends states: Annotation of all TFs in cluster D (columns) to the Reactome parent term. "Immune system" and "Cellular respondes to stimuli" are more consistenly involved in cluster D, in comparison to cluster A.. However, this is a key result which should be shown in a main figure, not in Figure S6. I would also recommend using a -log scale when displaying the p-values to highlight the most significant entries.
  
  o Panel E is quite confusing; first, the color coding is unclear. For instance, what represents blue, purple and red colors? Second, what represents the edges' widths? I would recommend using different shapes for the methylation and expression nodes to reduce the number of colors, and adding a color legend. I would also consider merging the two graphs and representing in color the difference in the edge values so the reader can directly see the key differences.
  
  Benchmarking analysis could be included to show the runtime and memory requirement for each pipeline step. It would also be beneficial to analyze a larger dataset than colon cancer to assess the scalability.
  
  Statistical analysis: If computationally feasible, permutation testing could be implemented to quantify the robustness of inferred regulatory interactions. Also, in the method section, it should be clarified that FDR correction was applied for pathway enrichment analysis.
  
  Minor comments:
  
  I am not sure why duplicate samples are discarded in the pipeline. Why not add counts for RNA-Seq and averaging beta values? I would expect that to yield more robust results.
  
  It is a bit unclear in what context the NetworkDataCompanion tool could be used outside the workflow. It is also unclear how it helps with quality controls. Please clarify these aspects.
  
  The manuscript is well-written, but words are sometimes missing or wrongly written, it needs careful re-read.
  
  The expression '"same-same"' is unclear to me.
  
  In this sentence: "Some of "same-same" genes (STAT5A, CREB3L1"…, I am not sure in which table or figure I can find this result?
  
  Text is too small in the Directed Acyclic Graph, especially in Figure S4. Also, I would recommend adding the Directed Acyclic Graphs from Figure S1-S4 to the online documentation.
  
  Regarding the code, I was puzzled to see a copyConfigFiles process. Also, there are files in bin/r/local_assets, these should be located in assets. And the container for the singularity and docker profile is likely the same, this should be clarified in the code.
  
  It is recommended to remove the "defaults" channel from the list of channels declared in the containers/conda_envs/analysis.yml file. Please see information about that here https://www.anaconda.com/blog/is-conda-free and here https://www.theregister.com/2024/08/08/anaconda_puts_the_squeeze_on/.
  
  Additional comments (which do not need to be addressed):
  
  Future work may consider enabling the use of the pipeline to build GRNs from other data sources than TCGA (i.e., nf-netzoo). Recount3 data is already being parsed for GTEx and TCGA samples, so it might be relatively easy to adapt the pipeline so that it can be used on any arbitrary recount3 dataset. Similarly, it could be useful if one could specify a dataset on the recountmethylation database 6 to build GRNs. While these unimodal datasets could not be used with the DRAGON method they would still benefit from all other features of the pipeline.
  
  Using a nf-core template would enable better structure of the code and increase the visibility of the tool. Also using multiple containers is usually easier to maintain and update than a single large container, especially when a single tool needs to be updated or when modifying part of the pipeline. Another comment is that the code contains many comments which are not to explain the code but more like quick draft which makes the code harder to read by others.
  
  References 1. Padi, M., and Quackenbush, J. (2018). Detecting phenotype-driven transitions in regulatory network structure. npj Syst Biol Appl 4, 1-12. https://doi.org/10.1038/s41540-018-0052-5. 2. Lim, J.T., Chen, C., Grant, A.D., and Padi, M. (2021). Generating Ensembles of Gene Regulatory Networks to Assess Robustness of Disease Modules. Front. Genet. 11. https://doi.org/10.3389/fgene.2020.603264. 3. Tao, W., Radstake, T.R.D.J., and Pandit, A. (2022). RegEnrich gene regulator enrichment analysis reveals a key role of the ETS transcription factor family in interferon signaling. Commun Biol 5, 1-12. https://doi.org/10.1038/s42003-021-02991-5. 4. Srivastava, H., Ferrell, D., and Popescu, G.V. (2022). NetSeekR: a network analysis pipeline for RNA-Seq time series data. BMC Bioinformatics 23, 54. https://doi.org/10.1186/s12859-021-04554-1. 5. Hu, Y., Guo, X., Yun, Y., Lu, L., Huang, X., and Jia, S. (2025). DisGeNet: a disease-centric interaction database among diseases and various associated genes. Database 2025, baae122. https://doi.org/10.1093/database/baae122. 6. Maden, S.K., Walsh, B., Ellrott, K., Hansen, K.D., Thompson, R.F., and Nellore, A. (2023). recountmethylation enables flexible analysis of public blood DNA methylation array data. Bioinformatics Advances 3, vbad020. https://doi.org/10.1093/bioadv/vbad020.
2. GigaScience 30 Oct 2025
  
  in GigaScience
  
  AbstractBackground Technological advances in sequencing and computation have allowed deep exploration of the molecular basis of diseases. Biological networks have proven to be a useful framework for interrogating omics data and modeling regulatory gene and protein interactions. Large collaborative projects, such as The Cancer Genome Atlas (TCGA), have provided a rich resource for building and validating new computational methods resulting in a plethora of open-source software for downloading, pre-processing, and analyzing those data. However, for an end-to-end analysis of regulatory networks a coherent and reusable workflow is essential to integrate all relevant packages into a robust pipeline.Findings We developed tcga-data-nf, a Nextflow workflow that allows users to reproducibly infer regulatory networks from the thousands of samples in TCGA using a single command. The workflow can be divided into three main steps: multi-omics data, such as RNA-seq and methylation, are downloaded, preprocessed, and lastly used to infer regulatory network models with the netZoo software tools. The workflow is powered by the NetworkDataCompanion R package, a standalone collection of functions for managing, mapping, and filtering TCGA data. Here we show how the pipeline can be used to study the differences between colon cancer subtypes that could be explained by epigenetic mechanisms. Lastly, we provide pre-generated networks for the 10 most common cancer types that can be readily accessed.Conclusions tcga-data-nf is a complete yet flexible and extensible framework that enables the reproducible inference and analysis of cancer regulatory networks, bridging a gap in the current universe of software tools.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf126), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 1: Xi Chen
  
  Fanfani et al. present tcga-data-nf, a Nextflow pipeline that streamlines the download, preprocessing, and network inference of TCGA bulk data (gene expression and DNA methylation). Alongside this pipeline, they introduce NetworkDataCompanion (NDC), an R package designed to unify tasks such as sample filtering, identifier mapping, and normalization. By leveraging modern workflow tools—Nextflow, Docker, and conda—they aim to provide a platform that is both reproducible and transparent. The authors illustrate the pipeline's utility with a colon cancer subtype example, showing how multi-omics networks (inferred via PANDA, DRAGON, and LIONESS) may help pinpoint epigenetic factors underlying more aggressive tumor phenotypes. Overall, this work addresses a clear need for standardized approaches in large-scale cancer bioinformatics. While tcga-data-nf promises a valuable resource, the following issues should be addressed more thoroughly before publication: 1. While PANDA, DRAGON, and LIONESS form a cohesive system, they were all developed by the same research group. To strengthen confidence, please include head-to-head comparisons with other GRN inference methods (e.g., ARACNe, GENIE3, Inferelator). A small benchmark dataset with known ground-truth (or partial experimental validation) would be especially valuable. 2. Although the manuscript identifies intriguing TFs and pathways, it lacks confirmation through orthogonal data or experiments. If available, consider including ChIP-seq or CRISPR-based evidence to reinforce at least a subset of inferred regulatory interactions. Even an in silico overlap with known TF-binding sites or curated gene sets would help validate the predictions. 3. PANDA and DRAGON emphasize correlation/partial correlation, so they may overlook nonlinear or combinatorial regulation. If feasible, please provide any preliminary steps taken to capture nonlinearities or discuss approaches that could be integrated into the pipeline. 4. LIONESS reconstructs a network for each sample in a leave-one-out manner, which can be demanding for large cohorts. The paper does not mention runtime or memory requirements. Adding a Methods subsection with approximate CPU/memory benchmarks (e.g., "On an HPC cluster with X cores, building LIONESS networks for 500 samples took Y hours") is recommended to guide prospective users. 5. Currently, the pipeline only covers promoter methylation and standard gene expression, yet TCGA and related projects include other data types (e.g., miRNA, proteomics, histone modifications). If possible, offer a brief example or instructions on adding new omics layers, even conceptually. 6. Recent methods often target single-cell RNA-seq, but tcga-data-nf is geared toward bulk datasets. Please clarify limitations and potential extensions for single-cell or multi-region tumor data. This would help readers understand whether (and how) the pipeline could be adapted to newer high-resolution profiles. Minor point: 1. Provide clear guidance on cutoffs for low-expressed genes, outlier samples, and methylation missing-value imputation. 2. Consider expanding the supplement with a "quick-start" guide, offering step-by-step usage examples. 3. Ensure stable version tagging in your GitHub repository so that readers can reproduce the exact pipeline described in the manuscript.
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2024.11.05.622163v1
www.biorxiv.org www.biorxiv.org

The enduring advantages of the SLOW5 file format for raw nanopore sequencing data

2
1. GigaScience 30 Oct 2025
  
  in GigaScience
  
  ABSTRACTNanopore sequencing is a widespread and important method in genomics science. The raw electrical current signal data from a typical nanopore sequencing experiment is large and complex. This can be stored in two alternative file formats that are presently supported: POD5 is a signal data file format used by default on instruments from Oxford Nanopore Technologies (ONT); SLOW5 is an open-source file format originally developed as an alternative to ONT’s previous file format, which was known as FAST5. The choice of format may have important implications for the cost, speed and simplicity of nanopore signal data analysis, management and storage. To inform this choice, we present a comparative evaluation of POD5 vs SLOW5. We conducted benchmarking experiments assessing file size, analysis performance and usability on a variety of different computer architectures. SLOW5 showed superior performance during sequential and non-sequential (random access) file reading on most systems, manifesting in faster, cheaper basecalling and other analysis, and we could find no instance in which POD5 file reading was significantly faster than SLOW5. We demonstrate that SLOW5 file writing is highly parallelisable, thereby meeting the demands of data acquisition on ONT instruments. Our analysis also identified differences in the complexity and stability of the software libraries for SLOW5 (slow5lib) and POD5 (pod5), including a large discrepancy in the number of underlying software dependencies, which may complicate the pod5 compilation process. In summary, many of the advantages originally conceived for SLOW5 remain relevant today, despite the replacement of FAST5 with POD5 as ONT’s core file format.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf118), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 2: Jan Voges
  
  Comments to Author: Synopsis: The manuscript builds on the authors' previous work introducing the SLOW5 format for Oxford Nanopore signal data as an improvement over the FAST5 format. Since then, Oxford Nanopore Technologies (ONT) has introduced its own new format, POD5. This paper directly compares SLOW5 and POD5. The authors claim that SLOW5 provides higher reading speeds for both sequential and random access, writing speeds sufficient to keep pace with data acquisition in sequencing machines, comparable file sizes with no significant storage penalty, a simpler implementation with fewer dependencies. The paper is clearly written, includes extensive supplementary information, and references the source code for all tools used in the experiments. Comments: - Sequential access performance: To me it is unclear whether SLOW5's advantage in sequential access originates from its file layout or from the use of mmap I/O versus traditional I/O. A small ablation study, forcing both SLOW5 and POD5 tools to use the same I/O method on platforms with currently large performance differences, would clarify where the performance gain originates from. - Figure 4: While POD5's dependency structure is indeed more complex than that of slow5lib, the current tree representation exaggerates this complexity. Many common packages (e.g., Python, zlib) appear multiple times as dependency of multiple other packages. A dependency graph where each package appears only once would be a more informative representation. - Figure 5: POD5 versions prior to 0.1.0 appear to be preview releases (and are even marked as such on GitHub). Breaking changes during early previews are normal, so including them in the same visual space as stable versions risks being misleading. - Figure 5: Breaking change at version 0.1.12: The timeline indicates a breaking change at POD5 version 0.1.12 which seems particularly relevant as the latest breaking change after version 0.1.0. However, this change is not reflected in the POD5 compatibility matrix on the right. An explanation of what type of breaking change occurred would clarify its impact and help readers assess compatibility risk. - Random access "walker strategy": A brief explanation comparing it to SLOW5's index-file approach would improve accessibility without requiring readers to consult external documentation.
2. GigaScience 30 Oct 2025
  
  in GigaScience
  
  ABSTRACTNanopore sequencing is a widespread and important method in genomics science. The raw electrical current signal data from a typical nanopore sequencing experiment is large and complex. This can be stored in two alternative file formats that are presently supported: POD5 is a signal data file format used by default on instruments from Oxford Nanopore Technologies (ONT); SLOW5 is an open-source file format originally developed as an alternative to ONT’s previous file format, which was known as FAST5. The choice of format may have important implications for the cost, speed and simplicity of nanopore signal data analysis, management and storage. To inform this choice, we present a comparative evaluation of POD5 vs SLOW5. We conducted benchmarking experiments assessing file size, analysis performance and usability on a variety of different computer architectures. SLOW5 showed superior performance during sequential and non-sequential (random access) file reading on most systems, manifesting in faster, cheaper basecalling and other analysis, and we could find no instance in which POD5 file reading was significantly faster than SLOW5. We demonstrate that SLOW5 file writing is highly parallelisable, thereby meeting the demands of data acquisition on ONT instruments. Our analysis also identified differences in the complexity and stability of the software libraries for SLOW5 (slow5lib) and POD5 (pod5), including a large discrepancy in the number of underlying software dependencies, which may complicate the pod5 compilation process. In summary, many of the advantages originally conceived for SLOW5 remain relevant today, despite the replacement of FAST5 with POD5 as ONT’s core file format.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf118), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 1: Wouter De Coster
  
  The authors describe the SLOW5 format and its benefits compared to the standard POD5 format for storing raw sequencing data from nanopore sequencers. The paper is well written and easy to understand. The advantages of the SLOW5 format are clear, and the comparison is adequately executed and described. However, the developers seem unable to persuade others to adopt it widely, and change might need to come from ONT themselves, who may be most concerned about disrupting their existing workflows, especially for parallel writing during sequencing. Nevertheless, the authors seem to have also addressed that issue, as demonstrated with a simulation experiment.
  
  Please find my specific suggestions below.
  
  Sincerely, Wouter De Coster
  
  Major: While I understand that the software name SLOW5 was an initial variation of the FAST5 format, I don't think that the words 'slow' or the number '5' are particularly appropriate descriptions or helpful in making a case for using the file format, as it is neither slow nor related to HDF5. However, once a name is chosen, I understand the reluctance to change it. Additionally, it seems the evaluations are conducted using the binary BLOW5 format. Wouldn't it then make more sense to emphasize BLOW5 in the text and title?
  
  Minor: I would italicize the 'make' tool for users unfamiliar with build tools in the Usability section, as it is a rather strange sentence if reading 'make' as a verb, not a tool. Perhaps the same could be applied to other dependencies in that section for consistency. Then again, the primary target audience will probably understand what 'make' means in this context.
  
  There is a typo in the benchmarking procedure section: 'confoudning'.
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2025.06.30.662478v1
www.biorxiv.org www.biorxiv.org

GTestimate: Improving relative gene expression estimation in scRNA-seq using the Good-Turing estimator

2
1. GigaScience 30 Oct 2025
  
  in GigaScience
  
  AbstractBackground Single-cell RNA-seq suffers from unwanted technical variation between cells, caused by its complex experiments and shallow sequencing depths. Many conventional normalization methods try to remove this variation by calculating the relative gene expression per cell. However, their choice of the Maximum Likelihood estimator is not ideal for this application.Results We present GTestimate, a new normalization method based on the Good-Turing estimator, which improves upon conventional normalization methods by accounting for unobserved genes. To validate GTestimate we developed a novel cell targeted PCR-amplification approach (cta-seq), which enables ultra-deep sequencing of single cells. Based on this data we show that the Good-Turing estimator improves relative gene expression estimation and cell-cell distance estimation. Finally, we use GTestimate’s compatibility with Seurat workflows to explore three common example data-sets and show how it can improve downstream results.Conclusion By choosing a more suitable estimator for the relative gene expression per cell, we were able to improve scRNA-seq normalization, with potentially large implications for downstream results. GTestimate is available as an easy-to-use R-package and compatible with a variety of workflows, which should enable widespread adoption.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf084), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 2: Amichai Painsky
  
  This paper introduces a Good-Turing (GT) estimation scheme for relative gene expression estimation and cell-cell distance estimation. The proposed methods, namely GTestimate, claims to improve upon conventional normalization methods by accounting for unobserved genes. The idea behind this contribution is fairly straightforward - since the relative gene expression is of large alphabet, a GT estimator is expected to preform better than a naive ML approach. However, I am not convinced that the authors applied it correctly. First, the proposed GT estimator (as appears in (GT)) in the text), assigns a zero estimate to unobserved genes (Cg = 0). This contradicts the entire essence of using a GT estimator. Second, it makes no since to use this expression for every Cg > 0. In fact, any reasonable GT based estimator applies GT for relatively small Cg, and ML estimator for large Cg. See [1] for a through discussion. The choice of a threshold between "small" and "large" Cg's is subject to many studied (for example [2], [1]), but it makes no sense to use the above expression for any Cg. Finally, notice that if N_{Cg} > 0 for some g but N_{Cg+1} = 0, the proposed estimator is not defined. There exists several smoothing solutions for such cases (for example [3]), but they need to be properly discussed. to conclude, I am not sure what is the effect of these issues on the experiments in the paper, which makes it difficult to assess the results.
  
  REFERENCES
  
  [1] A. Painsky, "Convergence guarantees for the good-turing estimator," Journal of Machine Learning Research, vol. 23, no. 279, pp. 1-37, 2022. [2] E. Drukh and Y. Mansour, "Concentration bounds for unigram language models." Journal of Machine Learning Research, vol. 6, no. 8, 2005. [3] W. A. Gale and G. Sampson, "Good-Turing frequency estimation without tears," Journal of quantitative linguistics, vol. 2, no. 3, pp. 217-237, 1995.
2. GigaScience 30 Oct 2025
  
  in GigaScience
  
  AbstractBackground Single-cell RNA-seq suffers from unwanted technical variation between cells, caused by its complex experiments and shallow sequencing depths. Many conventional normalization methods try to remove this variation by calculating the relative gene expression per cell. However, their choice of the Maximum Likelihood estimator is not ideal for this application.Results We present GTestimate, a new normalization method based on the Good-Turing estimator, which improves upon conventional normalization methods by accounting for unobserved genes. To validate GTestimate we developed a novel cell targeted PCR-amplification approach (cta-seq), which enables ultra-deep sequencing of single cells. Based on this data we show that the Good-Turing estimator improves relative gene expression estimation and cell-cell distance estimation. Finally, we use GTestimate’s compatibility with Seurat workflows to explore three common example data-sets and show how it can improve downstream results.Conclusion By choosing a more suitable estimator for the relative gene expression per cell, we were able to improve scRNA-seq normalization, with potentially large implications for downstream results. GTestimate is available as an easy-to-use R-package and compatible with a variety of workflows, which should enable widespread adoption.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf084), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 1: Gregory Schwartz
  
  In this manuscript, Fahrenberger et al. propose a new scRNA-seq normalization method to more accurately report UMI counts of individual cells. They specifically use a Good-Turing estimator, compared with a more commonly used Maximum Likelihood estimator, to adjust raw UMI counts. Using their own cta-seq, a cell targeted PCR-amplification strategy, as ground truth, they compare their estimator with a traditional size-corrected estimator. Furthermore, they illustrate downstream changes using their method, including changes to clustering results and spatial transcriptomic readouts. The manuscript was a clear read and presents an interesting alternative solution to an often overlooked, but important, problem. However, there are some aspects of the manuscript that need to be addressed. Some major content missing includes comparisons with more widely-used normalization methods throughout the manuscript, and better ground truth data sets in their downstream analysis. Specific comments are as follows:
  
  l. 34: To my knowledge, most groups do not use a single division by total UMI count as the only normalization. Seurat has NormalizeData, but also heavily promotes scTransform, a completely different method. Many use log transform (as I believe was done here), some use quantile transform, others use regression techniques etc. It was odd to see these standard normalizations missing in comparisons. The authors should use such standard procedures to demonstrate the superiority of GT.
  
  l. 42: Is there a justification for the successor function being applied within the frequency ((cg + 1) / total) instead of outside ((cg / total) + 1) as is expected with the Good-Turing estimation?
  
  Furthermore, there is typically a smoothing function for erratic N_cg values, which I would expect with single-cell data. In the methods there is a brief mention of linear smoothing, but that would imply that the GT equation is misleading and oversimplified. The actual equation should be included in the main text to avoid confusion.
  
  l. 58: Compared to 16,965 reads average per cell, what is the equivalent for the ultra-deep sequencing (not 23 million reads, as that is not 7.4 fold increase)?
  
  I am not entirely convinced on the use of cta-seq as a ground-truth for the cells, especially in comparison with ML. The authors should show that cta-seq has similar UMI and gene count distributions to more popular scRNA-seq technologies (e.g. 10x Chromium) or the application may be specific to cta-seq only.
  
  l. 110: Instead of using unknown classification data sets, there are existing cell-sorted data sets with ground truths (many even on the 10x website). The authors should use these data sets to compare downstream analysis.
  
  l. 125: The spatial transcriptomic results were very subjective, with no statistical hypotheses. The entire manuscript is missing any sort of statistics when comparing methods, which is a major flaw and should be rectified. Here specifically, the color scale stops at 3, but does this carry over to the relative differential expression? The claim is that it is constant, but if they are all greater than 3 then they must be quite variable, so it is surprising to see such a constant value of 0. Maybe the complete color scale should be shown on all figures to clarify this.
  
  From my understanding of the manuscript, the 18 cells for analysis and comparison were chosen based on a typical Seurat analysis. This technique introduces a range of biases into the comparison and makes the argument a bit circular.
  
  For a bias example, the top 2000 most variable genes were used, suggesting that entire classes of genes may be ignored even when highly or lowly expressed, such as housekeeping genes.
  
  There also appears to be many steps that were not entire justified outside of a "typical analysis", for example excluding a cluster in the analysis (just because it was not that large?), only selection 18 cells (why 6 from each cluster?), removing cells with less than 1000 expressed genes or over 8% mitochrondrial reads (this may be an issue, and removing specific cell types or proliferating cells, this should be a bivariate removal with justification). All of these filterings remove generalizeability of GT.
  
  Supplementary Figures in the text hyperlink to the main figures which is confusing. More importantly, the caption of Supplementary Figures read "Figure" rather than "Supplementary Figures".
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2024.07.02.601501v2
pmc.ncbi.nlm.nih.gov pmc.ncbi.nlm.nih.gov

ColoPola: A polarimetric imaging dataset for colorectal cancer detection

1
1. GigaScience 18 Oct 2025
  
  in Gigascience Annotations
  
  Availability of Supporting Source Code and Requirements
  
  DOME annotations are also available in the DOME registry here https://registry.dome-ml.org/review/futlrtl5w4
Visit annotations in context

Annotators

GigaScience

URL

pmc.ncbi.nlm.nih.gov/articles/PMC12530094/
Sep 2025
www.biorxiv.org www.biorxiv.org

A high-quality reference genome for the Ural Owl (Strix uralensis) enables investigations of cell cultures as a genomic resource for endangered species

2
1. GigaScience 30 Sep 2025
  
  in GigaScience
  
  AbstractBackground Reference genomes have a wide range of applications. Yet, we are from a complete genomic picture for the tree of life. We here contribute another piece to the puzzle by providing a high-quality reference genome for the Ural Owl (Strix uralensis), a species of conservation concern and efforts affected by habitat destruction and climate change.Results We generated a reference genome assembly for the Ural Owl based on high-fidelity (HiFi) long reads and chromosome conformation capture (Hi-C) data. It figures amongst the best avian genome assemblies currently available (BUSCO completeness of 99.94 %). The primary assembly had a size of 1.38 Gb with a scaffold N50 of 90.1 Mb, while the alternative assembly had a size of 1.3 Gb and a scaffold N50 of 17.0 Mb. We show an exceptionally high repeat content (21.07 %) that is different from those of other bird taxa with repeat extensions. We confirm a Strix characteristic chromosomal fusion and support the observation that bird microchromosomes have a higher density of genes, associated with a reduction in gene length due to shorter introns. An analysis of gene content provides evidence of changes in the keratin gene repertoire as well as modifications of metabolism genes of owls. This opens an avenue of research if this is related to flight adaptations. The population size history of the Ural Owl decreased over long periods of time with increases during the Eemian interglacial and stable size during the last glacial period. Ever since it is declining to its currently lowest effective population size. We also investigated cell culture of progressive passages as a tool for genetic resources. Karyotyping of passages confirmed no large variants, while a SNP analysis revealed a low presence of short variants across cell passages.Conclusions The established reference genome is a valuable resource for ongoing conservation efforts, but also for (avian) comparative genomics research. Further research is needed to determine whether cell culture passages can be safely used in genomic research.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf106), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 1: Jianbo Jian
  
  The authors provide a high-quality reference genome for the Ural Owl (Strix uralensis), these genomic resources are valuable for conservation and evolution. The manuscript is well-written, and the scientific story with cell culture for conservation is interesting. I have some questions or comments as following: 1、 in abstract， the N50 is contig or scaffold？ 2、For the GenomeScope analysis, the estimated genome size is 1.29 Gb with low heterozygosity (0.2%). The assembled genome size is 1.38 Gb. Could there be duplicated genome sequences in the assembly, or did the genome survey evaluation exclude some k-mers? What were the parameters used in GenomeScope2 (e.g., was the -h parameter set to its default value)? 3、How do you perform the decontamination？ 4、For the Hi-C contact map, due to some chromosomes are considerably larger while others are much smaller, it is suggested that the larger chromosomes could be displayed independently from the smaller ones to enhance clarity and interpretation.
2. GigaScience 30 Sep 2025
  
  in GigaScience
  
  AbstractBackground Reference genomes have a wide range of applications. Yet, we are from a complete genomic picture for the tree of life. We here contribute another piece to the puzzle by providing a high-quality reference genome for the Ural Owl (Strix uralensis), a species of conservation concern and efforts affected by habitat destruction and climate change.Results We generated a reference genome assembly for the Ural Owl based on high-fidelity (HiFi) long reads and chromosome conformation capture (Hi-C) data. It figures amongst the best avian genome assemblies currently available (BUSCO completeness of 99.94 %). The primary assembly had a size of 1.38 Gb with a scaffold N50 of 90.1 Mb, while the alternative assembly had a size of 1.3 Gb and a scaffold N50 of 17.0 Mb. We show an exceptionally high repeat content (21.07 %) that is different from those of other bird taxa with repeat extensions. We confirm a Strix characteristic chromosomal fusion and support the observation that bird microchromosomes have a higher density of genes, associated with a reduction in gene length due to shorter introns. An analysis of gene content provides evidence of changes in the keratin gene repertoire as well as modifications of metabolism genes of owls. This opens an avenue of research if this is related to flight adaptations. The population size history of the Ural Owl decreased over long periods of time with increases during the Eemian interglacial and stable size during the last glacial period. Ever since it is declining to its currently lowest effective population size. We also investigated cell culture of progressive passages as a tool for genetic resources. Karyotyping of passages confirmed no large variants, while a SNP analysis revealed a low presence of short variants across cell passages.Conclusions The established reference genome is a valuable resource for ongoing conservation efforts, but also for (avian) comparative genomics research. Further research is needed to determine whether cell culture passages can be safely used in genomic research.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf106), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 2: Luohao Xu
  
  This manuscript provides a high-quality genome of Ural owl which is of evolutionary and ecological importance, as well as cell cultures that is worth exploration for endangered species. But Oral owl does not seem to be an endangered species?
  
  One chromosomal fusion was identified, but it is very important to specific which chromosome. The chromosomes are very conserved in birds. The authors should follow the chromosome nomenclature according to chicken chromosome homology (http://pnas.org/doi/10.1073/pnas.2216641120).
  
  "bird microchromosomes have a higher density of genes" is already known for 20 years, so no need to confirm again.
  
  It is very speculative to link karatin gene expansions to flight adaptions. I suggest to revise this statement throughout the manuscript.
  
  The first paragraph lacks any citations. And the statements are not fully accurate because there are already reference genomes in Strigiformes (owls), some of which were generated by the bioEarch project.
  
  L120, I don't think this is true?
  
  L131, remove million?
  
  L158, again, the authors need to make sure that those chromosomes are homologous to chicken chromosomes. It is known that the 10 smallest microchromosomes are difficult for assembly due to HiFi sequencing dropout (Huang 2023 PNAS). I am curious whether the 10 smallest microchromosomes (or dot chromosomes) have been correctly assembled? The figure 3 does not seem to show this information.
  
  For the 17 lost genes, are they lost in all reference genomes, or just "supported by more than one reference genome" (L260)?
  
  It is not surprising to me that kerain, immune and olfactory receptor genes are independently expanded in different bird lineages.
  
  L284-285, this statement is not true, because females also have a Z chromosome. Maybe the sequence coverage of the Z chromosome can be used to confirm the sex.
  
  L361, cite B10K publications.
  
  L370, "identified" should be "confirmed"?
  
  L378, this is a bit misleading, because it is clear that barn owls have microchromosomes.
  
  L382, "mainly composed of centromeric satellite DNA", and L387-388 are not true. LINEs the LTRs should still be the major repeat contents.
  
  L395-396, "In birds, microchromosomes possibly originate from chromosome fission.", this is not true, again see Huang 2023 PNAS.
  
  The paragraph starting from L394 is already well know. No need to discuss this. Overall, the discussion part needs to be streamlined, including the paragraph at L434 and L455
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2025.01.22.633903v4
www.biorxiv.org www.biorxiv.org

WaveSeekerNet: Accurate Prediction of Influenza A Virus Subtypes and Host Source Using Attention-Based Deep Learning

3
1. GigaScience 30 Sep 2025
  
  in GigaScience
  
  AbstractBackground Influenza A virus (IAV) poses a significant threat to animal health globally, with its ability to overcome species barriers and cause pandemics. Rapid and accurate IAV subtypes and host source prediction is crucial for effective surveillance and pandemic preparedness. Deep learning has emerged as a powerful tool for analyzing viral genomic sequences, offering new ways to uncover hidden patterns associated with viral characteristics and host adaptation.Findings We introduce WaveSeekerNet, a novel deep learning model for accurate and rapid prediction of IAV subtypes and host source. The model leverages attention-based mechanisms and efficient token mixing schemes, including the Fourier Transform and the Wavelet Transform, to capture intricate patterns within viral RNA and protein sequences. Extensive experiments on diverse datasets demonstrate WaveSeekerNet’s superior performance to existing models that use the traditional self-attention mechanism. Notably, WaveSeekerNet rivals VADR (Viral Annotation DefineR) in subtype prediction using the high-quality RNA sequences, achieving the maximum score of 1.0 on metrics including the Balanced Accuracy, F1-score (Macro Average), and Matthews Correlation Coefficient (MCC). Our approach to subtype and host source prediction also exceeds the pre-trained ESM-2 (Evolutionary Scale Modeling) models with respect to generalization performance and computational cost. Furthermore, WaveSeekerNet exhibits remarkable accuracy in distinguishing between human, avian, and other mammalian hosts. The ability of WaveSeekerNet to flag potential cross-species transmission events underscores its significant value for real-time surveillance and proactive pandemic preparedness efforts.Conclusions WaveSeekerNet’s superior performance, efficiency, and ability to flag potential cross-species transmission events highlight its potential for real-time surveillance and pandemic preparedness. This model represents a significant advancement in applying deep learning for IAV classification and holds promise for future epidemiological, veterinary studies, and public health interventions.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf089), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 3:Weihua Li
  
  (1) In the abstract, the statement 'WaveSeekerNet achieves scores of up to the maximum 1.0 across all evaluation metrics, including F1-score (Macro Average)' appears to slightly deviate from the actual experimental results. (2) In data preprocessing, the reasoning behind selecting and keeping the earliest collected sequence when duplicate sequences are encountered should be explained. (3) Compared with Figure 4, Figure 5 demonstrates performance improvements in most cases, but why is this not observed for some results in Figure 4d? (4) Could the oversampling/undersampling methods employed in the study introduce any potential biases to the analysis? (5) Given that VADR can provide viral classification and annotation information—which serves as the benchmark in this study, what specific advantages does WaveSeekerNet offer for subtype classification? (6) The paper employs 10-fold cross-validation to assess generalizability, yet the data processing section describes a temporal split (pre-2020 for training). Could the "Model Training and Testing" section provide further clarification on this?
2. GigaScience 30 Sep 2025
  
  in GigaScience
  
  AbstractBackground Influenza A virus (IAV) poses a significant threat to animal health globally, with its ability to overcome species barriers and cause pandemics. Rapid and accurate IAV subtypes and host source prediction is crucial for effective surveillance and pandemic preparedness. Deep learning has emerged as a powerful tool for analyzing viral genomic sequences, offering new ways to uncover hidden patterns associated with viral characteristics and host adaptation.Findings We introduce WaveSeekerNet, a novel deep learning model for accurate and rapid prediction of IAV subtypes and host source. The model leverages attention-based mechanisms and efficient token mixing schemes, including the Fourier Transform and the Wavelet Transform, to capture intricate patterns within viral RNA and protein sequences. Extensive experiments on diverse datasets demonstrate WaveSeekerNet’s superior performance to existing models that use the traditional self-attention mechanism. Notably, WaveSeekerNet rivals VADR (Viral Annotation DefineR) in subtype prediction using the high-quality RNA sequences, achieving the maximum score of 1.0 on metrics including the Balanced Accuracy, F1-score (Macro Average), and Matthews Correlation Coefficient (MCC). Our approach to subtype and host source prediction also exceeds the pre-trained ESM-2 (Evolutionary Scale Modeling) models with respect to generalization performance and computational cost. Furthermore, WaveSeekerNet exhibits remarkable accuracy in distinguishing between human, avian, and other mammalian hosts. The ability of WaveSeekerNet to flag potential cross-species transmission events underscores its significant value for real-time surveillance and proactive pandemic preparedness efforts.Conclusions WaveSeekerNet’s superior performance, efficiency, and ability to flag potential cross-species transmission events highlight its potential for real-time surveillance and pandemic preparedness. This model represents a significant advancement in applying deep learning for IAV classification and holds promise for future epidemiological, veterinary studies, and public health interventions.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf089), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 2:Slim Fourati
  
  Nguyen HH and collaborators trained an ensemble-like deep learning model on HA and NA sequences extracted from GISAID (sequences collected from 1902 to 2019) to predict 1/influenza subtype and 2/host source. Their model was tested on HA and NA sequences collected from 2020 to 2025 and showed improved accuracies compared to other deep learning models. The article is of good quality, with well-documented methods and with proper use of a test set that would mimic real case use of the model (the model would be used on future sequences) and the use of a standard metric to assess the accuracy of the model (F1-score, Bal. Acc, MCC). The figures and tables support the conclusions of the article.
  
  I only have two minor edits that I would suggest to the authors: 1. In the first paragraph of the introduction, the authors explain why predicting host sources is important (for active surveillance and our preparedness for future pandemics). Can the authors explain why predicting influenza subtype is also crucial? 2. lines 573-575. The authors argue that their model is better suited to predict rare variants than previous models like MC-NN. Do the authors think this is only the result of the upsampling of those sequences?
3. GigaScience 30 Sep 2025
  
  in GigaScience
  
  AbstractBackground Influenza A virus (IAV) poses a significant threat to animal health globally, with its ability to overcome species barriers and cause pandemics. Rapid and accurate IAV subtypes and host source prediction is crucial for effective surveillance and pandemic preparedness. Deep learning has emerged as a powerful tool for analyzing viral genomic sequences, offering new ways to uncover hidden patterns associated with viral characteristics and host adaptation.Findings We introduce WaveSeekerNet, a novel deep learning model for accurate and rapid prediction of IAV subtypes and host source. The model leverages attention-based mechanisms and efficient token mixing schemes, including the Fourier Transform and the Wavelet Transform, to capture intricate patterns within viral RNA and protein sequences. Extensive experiments on diverse datasets demonstrate WaveSeekerNet’s superior performance to existing models that use the traditional self-attention mechanism. Notably, WaveSeekerNet rivals VADR (Viral Annotation DefineR) in subtype prediction using the high-quality RNA sequences, achieving the maximum score of 1.0 on metrics including the Balanced Accuracy, F1-score (Macro Average), and Matthews Correlation Coefficient (MCC). Our approach to subtype and host source prediction also exceeds the pre-trained ESM-2 (Evolutionary Scale Modeling) models with respect to generalization performance and computational cost. Furthermore, WaveSeekerNet exhibits remarkable accuracy in distinguishing between human, avian, and other mammalian hosts. The ability of WaveSeekerNet to flag potential cross-species transmission events underscores its significant value for real-time surveillance and proactive pandemic preparedness efforts.Conclusions WaveSeekerNet’s superior performance, efficiency, and ability to flag potential cross-species transmission events highlight its potential for real-time surveillance and pandemic preparedness. This model represents a significant advancement in applying deep learning for IAV classification and holds promise for future epidemiological, veterinary studies, and public health interventions.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf089), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 1:Will Dampier
  
  The manuscript presented by Nguyen et al. is well written, well researched, and well executed. The use of this new "wavelet style" neural network shows both an increased training efficiency and improved accuracy at detecting influenza subtypes for surveillance. However, I think their comparison to a 'plain' Transformer model does not take advantage of the improvements in pre-training and transfer-learning that have become standard practice in deep-learning. I have also included some stylistic suggestions to improve the figures as presented. After addressing these comments, I believe that this will become a very strong manuscript.
  
  Major Comments:
  
  The authors present a comparison between their new wavelet architecture and a standard transformer architecture using a one-hot encoded vector of amino-acids. I believe that this is the correct 'null model' to compare your wavelet architecture to, however, it does not represent the 'state of the art' in utilizing transformers for sequence analysis. As I'm sure the authors are aware, the disadvantage of transformers is that they take an extensive amount of training (they note the transformer only models take 2-4X more training epochs to converge). However, the advantage they bring is that they can be extensively trained for one task and then transfer that learning to another related task. A number of models have been pre-trained on giant collections of proteins Asgari et al, https://doi.org/10.1371/journal.pone.0141287 and Rives et al https://doi.org/10.1073/pnas.2016239118 which then allow one to transfer that knowledge to different domains with fewer examples such as demonstrated in Dampier et al https://doi.org/10.3389/fviro.2022.880618. It would be interesting to see whether your wavelet model defeats these pre-trained models with transfer learning. If you showed that, you could argue that there is no need for the extensive expense of 'foundational models'.
  
  The authors discuss that there is a significant imbalance in the training set and they used up-sampling and limiting to balance out the class representation. Since the classes are not equally represented, the model may not be equally able to predict each class. And the high metrics may only be a representation of its ability to predict the popular classes correctly. The authors should include an additional set of figures (supplemental is fine) that show the metrics broken out by Subtype. It would also be interesting to see a graph of the class-size (before up-sampling) vs F1-score (or another metric) on that class. This could provide lower-bounds for how many samples are needed to train the model.
  
  Minor Comments:
  
  Figures 3, 4, and 5: These would benefit from a linked y-axis. It is hard to compare across A/B/C/D when the axes have different y-limits.
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2025.02.25.639900v3
www.biorxiv.org www.biorxiv.org

PathoGFAIR: a collection of FAIR and adaptable (meta)genomics workflows for (foodborne) pathogens detection and tracking

2
1. GigaScience 30 Sep 2025
  
  in GigaScience
  
  AbstractBackground Food contamination by pathogens poses a global health threat, affecting an estimated 600 million people annually. During a foodborne outbreak investigation, microbiological analysis of food vehicles detects responsible pathogens and traces contamination sources. Metagenomic approaches offer a comprehensive view of the genomic composition of microbial communities, facilitating the detection of potential pathogens in samples. Combined with sequencing techniques like Oxford Nanopore sequencing, such metagenomic approaches become faster and easier to apply. A key limitation of these approaches is the lack of accessible, easy-to-use, and openly available pipelines for pathogen identification and tracking from (meta)genomic data.Findings PathoGFAIR is a collection of Galaxy-based FAIR workflows employing state-of-the-art tools to detect and track pathogens from metagenomic Nanopore sequencing. Although initially developed to detect pathogens in food datasets, the workflows can be applied to other metagenomic Nanopore pathogenic data. PathoGFAIR incorporates visualisations and reports for comprehensive results. We tested PathoGFAIR on 130 samples containing different pathogens from multiple hosts under various experimental conditions. For all but one sample, workflows have successfully detected expected pathogens at least at the species rank. Further taxonomic ranks are detected for samples with sufficiently high Colony-forming unit (CFU) and low Cycle Threshold (Ct) values.Conclusions PathoGFAIR detects the pathogens at species and subspecies taxonomic ranks in all but one tested sample, regardless of whether the pathogen is isolated or the sample is incubated before sequencing. Importantly, PathoGFAIR is easy to use and can be straightforwardly adapted and extended for other types of analysis and sequencing techniques, making it usable in various pathogen detection scenarios. PathoGFAIR homepage: https://usegalaxy-eu.github.io/PathoGFAIR/
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf017), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 2: Ann-Katrin Llarena
  
  Nasr and colleagues present an, at times, well-written manuscript with an interesting and robust pipeline that includes well-known softwares (you must make sure to cite the authors of these). However, the manuscript is, quote "...a collection of Galaxy-based FAIR workflows employing state-of-the-art tools to detect and track pathogens from metagenomic Nanopore sequencing". Its repeated how well it works, they even compare it to other software in table 1 (without proper benchmark). These initial statements are however not supported by the findings. The Salmonelal from the spiked samples are, as expected from food matrix present in low quantity), difficult to do more than state that the genus is present, and only a fraction of the samples can actually "complete" the entire pipeline. Also, the benchmarking is not really benchmarking (compare and measure this software against other competing software). No such comparison is done, and even though the intention of PathoGFAIR as stated throughout the paper, is detection and analysis of metagenomic samples, the benchmarking is done on isolate based wgs. It is also evident that the authors are not microbiologists as the manuscript is riddled with taxonomical misunderstandings about the vast genus Salmonella and when to use capital letters and italics. I am also lacking a proper discussion here on the results found in the spiking experiment in light of current EU legislation on Salmonella. Can this pipeline help in this regard? Sensitivity and specificity metrics are also lacking.
  
  Abstract: "foodborne pathogen data" / "metagenomic Nanopore pathogenic data" - suggest to rewrite, as what I think you are trying to say is " initially developed to detect foodborne pathogens from metagenomic nanopore data, the workflow can be used to detect any pathogen." "Colony-forming unit and Cycle Threshold values." rewrite sentence, I do not completely understand what you are trying to say. what is "sufficient colony forming units?" It will vary as well between pathogens (infection dose varies). You could rather state your sensitivity of the pipeline here - even though i think that sampling prep, library prep and seq influences that more than the bioinformatics. "In any sample": did you test all matrixes? "sample is isolated or incubated before seq" you cannot isolate a sample, but you isolate a bacteria from a sample. unprecise language.
  
  Introduction: In general, its well written, but a bit unprecise here and there. The authors also rely a lot on the following words: "rapid" "accurate". "outbreaks and epidemics" - rewrite, these are the same. "efforts to mitigate their spread and ensure food safety" again, complementary terms - rewrite. "global public health authorities" we do have everything from local to global food safety and public health authorities, I think one should highlight this. There is a difference between for instance EFSA and ECDC. "isolation can be complex"? do you mean complicated or work intensive? "The utilisation of Nanopore sequencing data, as exemplified in studies like [7]," citing practices like this is not really reader friendly. Suggest to write what they actually did in seven (as for instance the detection of blah in blah as shown in 7). "Once (meta)genomics data has been generated, bioinformatics approaches enable the rapid and accurate detection"; repetition of chapter above. You write in the former chapter that "the utilisation of nanopore data" which also includes bioinformatics of course. SURPI and Sunbeam is freely available? https://microbiomejournal.biomedcentral.com/articles/10.1186/s40168-019-0658-x https://chiulab.ucsf.edu/surpi/
  
  "PathoGFAIR: pathogen identification and tracking from metagenomics". Im not convinced that it can perform tracing in an outbreak where only a few SNPs are allowed. PathoGFAIR does not really speed up the process of sampling, does it. Actually, it takes more time to extract crude dna from a sample than to place it in a enrichment broth or do a dilution series, so the presteps are not really a part of this. "Tracking pathogens" - again, if species level is the lowest rank it can go to, its not enough to perform tracking.
  
  Overview chapter "input data is seq data generated w nanopore" basecalling is not included in the workflow? How is this performed? It affects the quality of the reads, so its nice to know what you did. The chapter is very wordy, and contains a lot of fill-words with salespitches almost. I would recommend rewriting it, for instance: Chapter that starts with subsequently and describes the different workflows and how they work together can be compressed. And the last three sections are salespitching.
  
  WF1: Preprocessing: How stringent filtering and quality control are implemented in the workflow? How good quality do you need for the wp2-4 to work sufficiently well? Did you test? Food vehicle animal? What is that - do you mean that if you extract dna from bovine meat, you map to bovine genome? "a tool ten times faster etc etc." is discussion and should be removed from what I think is materials and methods even though the title of the section is workflow 1. What is a food host? Kalamari database includes many foodborne pathogens, such as Shigella, E. coli, Campylobacter etc etc. how can you just remove all reads that match to this database? Table 1: Innuendo is based on isolate WGS, and not intended for WGS. Also, it has its own built in wgMLST schema employed using chewbbacca, so it definitely has allele-abased pathogen identification. Its intended for illumina data. Victors are strictly a platform to analyse virulence factors and not intended even for taxonomic profiling, and its webinterface doesn't work. IDseq has step-by-step guides available on their webpage, so I think that qualifies as a tutorial. You can also contact them (user support). I guess the same is true for OneCodex, as you actually pay for that one. So the table is unprecise at best and should be corrected (I didn't go through Submeam, SURPI or PAIPline specs to try to check if you got it correctly). Rewrite this. Further, I think you should only include systems / pipelines that are intended for metagenomics. You have a footnote * that I cannot see in the table as well.
  
  WF2 taxonomy profiling: The first sentence needs rewriting. Two sentences from "Although Kraken2 is a tool design…….." belongs in discussion. WF3: Medaka consensus pipeline : "This task is performed using neural networks applied from a pileup of individual sequencing reads against a draft assembly. " what draft assembly did you use here to create a consensus sequence? Actually, its not polishing contigs, its assemblying them? Again, there is some descriptions of the software which belongs in the discussion, say the perks one gets from using this tool over the other. I do not however get how screening for virulence genes = pathogen identification. The thing is that in a complex food matrix or faecal samples from animals, things like stx phages will also be present. These are not stec pathogens unless the phage is inside an e.coli. How do you make sure of the host for such mobile genetic elements as these virulence and amr genes often are located on? Seeing as this is the basis of your pathogen detection?
  
  WF4: A bit again on choosing software over the other that is discussion food. Wf4/wf5: I am worried about the reliance on snp based technics for nanopore reads. Is the quality good enough to achieve sufficiently robust results? Easily adaptable workflows Last section is repetition (about each wf operating independently) Use cases: Data generation: Please revise how to write Salmonella names correctly. They should be in italics for genus, species and subspecies names, while the serovar/serotype is non italic and capital letter. So the correct term would be: * Salmonella enterica subsp. enterica serovar Houtenae, or in short; Salmonella Houtenae. * The strain DSM554 is of serovar Typhimurium, and this should referenced like this: Salmonella enterica subsp. enterica serovar Typhimurium strain DSM 554 First two sentences are contradictory to eachother? Sentence starting "15 samples were incubated"; don't start sentence with number, it looks like 33.15 How much meat did you use? What CFU/g does these ct values translate too? Its important to know the sensitivity relative to legislation. The limit is zero in 100grams, but I don't assume you tested 100g? What does adaptive sampling mean? To exclude chicken DNA? The point v sentence under description of supplementary table t1 is a bit weird punctuation Gene-based pathogen identification: Working with meat to detect low abundance pathogenic bacteria is challenging without enrichment of the expected pathogen with selective methods. Just incubating it a x temperature might work for some bacteria, but others need special atmosphere (campylobacter, clostridia) and nutrients. How do you accommodate this? Figure 2 B: The grey bares samples ? why are they collapsed in the left corner? And shy are sdhA and mucD highlighted? Also, please put genes in italics. the grey bars on the right (y-axis) are not annotated? To which reference genome are the barplot in d referring to? I can see for instance in f that there is a number of snps or variants for the Houtenae and Typhimurium, but not Salamae, was the latter used as reference? "an AIDA autotransporter-like protein, only found in Enterica strain samples but not in samples spiked with Houtenae or Salamae strains." All these strains are of the subspecies enterica Figure 3: punctuations a bit off here and there. Why do you operate with cfu/ml? You added it to meat? It should be cfu/g? It would be nice with a presentation of the resistance panel of the three spiked strains before presenting the amr genes. "Similar but inverse relations are observed for CFU/mL value (Figure 3 C & D), with a threshold for VF and AMR gene detection at 106 ." cfu/ml of what? The rinse? Added ml? I don't even know how much meat were included in the dna extractions. "The further the samples are from these thresholds, the higher the number of VF genes and AMR genes identified. Indeed, the three top scattered dots with identified VF genes between 250 and 300 (Figure 3 A, C, E) are the samples with the highest number of reads, higher CFU/mL value, and a relatively lower Ct value compared to other samples." The tendency is ok, but not all. For instance, you have several exceptions here for both amr genes and vf genes. Maybe mark the dots after say spiked strain/enrichment or not?
  
  Discussion bit here : "enerally, allowing samples to incubate for a short period before se quencing enhances microbial growth, resulting in higher CFU/mL values and lower Ct values. This increase in microbial concentra tion improves the efficiency of direct sequencing by providing more genetic material for analysis, facilitating faster and more accurate pathogen detection. "
  
  Allele-based pathogen identification: "Salmonella enterica subspecies enterica serovar typhimarium (NC_003197.2)": see earlier comment on writing correct taxonomically for Salmonella. "However, given the diversity among Salmonella subspecies in the samples, a high number of complex variants and SNPs were anticipated. " You only operate with ONE subspecies of Salmonella - S. enterica subsp. enterica. That's the relevant subspecies, and contains over 2500 serovariants. I don't understand this process; in an outbreak setting you are dependent on tracing, i.e. showing that you isolates are clonal. Pathogfair relies on mapping to a reference genome, but that again relies on isolation of suspected isolate and building a high quality assembly for the allel-based pathogen identification to work. Its not enough to just show that you have that or that serotype, you will have to show that they are clonal (i.e. separated by a limited number of SNPs, say max 20 snps over the full length of the chromosome). This method cannot do this. Samples with prior pathogen isolation: Do understand you correctly that you now exstract dna from isolates? Not whole samples matrix? If so, how is this benchmarking a pipeline intended for metagenomics sequencing? If you were to extract dna from feces/ food and then use your pipeline, that would be benchmarking. However, this doesn't prove that your pipeline works as you intend it to/or claim that it does. How were the samples prepared? If isolates, extraction method and sequencing techniques? Species name is written non-capitalized first letter, so Campylobacter jejuni. All gene names should be italicized. Suggest rewriting sentence: The wet lab procedures performed to isolate and prepare these samples for sequencing adhered to standard microbiological techniques, including cultivation, enrich ment, and isolation steps" to reflect actual sequel; enrichment, cultivation and isolation and verification." Conclusion: If for use for solely isolates, I think assemblies are a better way to go than this pipeline; its more reliable for clonality analysis needed in outbreaks. "We further supported the scientific community by introducing new 46 benchmark samples, making them publicly available. This demonstrates our significant investment of time and resources, providing valuable assets for future research." There are now 82000 c. jejuni just on ncbi, of which 600 are complete. Salmonella genomes are clocking on 524500 assemblies on enterobase. The contribution of these strains are not because they are new samples, but because your isolates represent data from an underrepresented region of the world, namely Palestine.
  
  Supplmentary figure s4 is cropped so that x-line annotation is not visible. SFigure 5 Midpoint root amr phylogenetic tree? Supplementary table 1: its unclear for me if you added this amount of bacteria or it was the result of after 1h or 24h enrichment. Also, I don't understand how much meat you used for the dna extraction. Same goes for ct values.
2. GigaScience 30 Sep 2025
  
  in GigaScience
  
  AbstractBackground Food contamination by pathogens poses a global health threat, affecting an estimated 600 million people annually. During a foodborne outbreak investigation, microbiological analysis of food vehicles detects responsible pathogens and traces contamination sources. Metagenomic approaches offer a comprehensive view of the genomic composition of microbial communities, facilitating the detection of potential pathogens in samples. Combined with sequencing techniques like Oxford Nanopore sequencing, such metagenomic approaches become faster and easier to apply. A key limitation of these approaches is the lack of accessible, easy-to-use, and openly available pipelines for pathogen identification and tracking from (meta)genomic data.Findings PathoGFAIR is a collection of Galaxy-based FAIR workflows employing state-of-the-art tools to detect and track pathogens from metagenomic Nanopore sequencing. Although initially developed to detect pathogens in food datasets, the workflows can be applied to other metagenomic Nanopore pathogenic data. PathoGFAIR incorporates visualisations and reports for comprehensive results. We tested PathoGFAIR on 130 samples containing different pathogens from multiple hosts under various experimental conditions. For all but one sample, workflows have successfully detected expected pathogens at least at the species rank. Further taxonomic ranks are detected for samples with sufficiently high Colony-forming unit (CFU) and low Cycle Threshold (Ct) values.Conclusions PathoGFAIR detects the pathogens at species and subspecies taxonomic ranks in all but one tested sample, regardless of whether the pathogen is isolated or the sample is incubated before sequencing. Importantly, PathoGFAIR is easy to use and can be straightforwardly adapted and extended for other types of analysis and sequencing techniques, making it usable in various pathogen detection scenarios. PathoGFAIR homepage: https://usegalaxy-eu.github.io/PathoGFAIR/
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf017), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 1: Federico Zambelli
  
  The authors present PathoGFAIR, a set of Galaxy workflows for the metagenomic analysis of shotgun Nanopore sequencing from isolated and non-isolated pathogens in contaminated food samples. They complement their work by analysing and releasing two datasets, one from isolated and the other from non-isolated samples, with the primary objective of illustrating the potentiality of the workflows. These datasets could also be used as benchmarks for future works.
  
  The manuscript is generally well-written, and the authors highlight the advantages of the proposed workflows in Table 1 by comparing them to similar solutions. The workflows are well integrated into the Galaxy network, are available on the three main usegalaxy instances, and provide a thorough tutorial through the Galaxy training platform. A notable advantage of PathoGFAIR over similar workflows is that, thanks to Galaxy, the final user can easily tailor them by replacing any tool in the workflow with others available in the Galaxy ecosystem. This also allows easy updates for the tools in the workflows.
  
  A few minor points that, if addressed, in my opinion, could further strengthen the manuscript:
  
  1 - The rationale behind the tool selection in each of the four workflows is not always clear. While insights are present for workflows 1 and 4, this is not true for workflows 2 and 3. The reader would benefit from understanding why one tool has been preferred over another for the same task, even more so, given the possibility to modify the workflows easily, when this preference could be the other way around in particular use cases or conditions.
  
  2—One of the main factors for a successful metagenomic analysis is the correctness, completeness, and up-to-dateness of the reference data. The authors should briefly describe how PathoGFAIR addresses this in Galaxy.
  
  3—While this workflow is clearly stated to be tailored for shotgun metagenomic sequencing, the authors contrast this approach only with targeted sequencing. Instead, they should also discuss the 16s rRNA metagenomic approach, for which Nanopore kits are available, and why PathoGFAIR has been limited to the analysis of shotgun data.
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2024.06.26.600753v2
www.biorxiv.org www.biorxiv.org

First chromosome-level genome assembly of the colonial tunicate Botryllus schlosseri

3
1. GigaScience 30 Sep 2025
  
  in GigaScience
  
  Botryllus schlosseri (Tunicata) is a colonial chordate that has long been studied for its multiple developmental pathways and regenerative abilities and its genetically determined allorecognition system based on a polymorphic locus that controls chimerism and cell parasitism. We present the first chromosome-level genome assembly from an isogenic colony of B. schlosseri clade A1 using a mix of long and short reads scaf-folded using Hi-C. This haploid assembly spans 533 Mb, of which 96% are found in 16 chromosome-scale scaffolds. With a BUSCO completeness of 91.2%, this complete and contiguous B. schlosseri genome assembly provides a valuable genomic resource for the scientific community and lays the foundation for future investigations into the molecular mechanisms underlying coloniality, regeneration, histocompatibility, and the immune system in tunicates.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf097), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 3: Cristian Canestro
  
  TO THE AUTHORS
  
  In this MS entitled 'First chromosome-level genome assembly of the colonial chordate model Botryllus schlosseri (Tunicata)', Olivier De Thier and colleagues report the first chromosome-scale assembly of this colonial ascidian specie, paying special attention to differences with previous published assemblies and importantly between haplotypes. The MS is very well written, very easy and pleasant to read. This provides data of great quality and very relevant not only for the ascidian/tunicate community, but to the field of genome structural evolution. I firmly recommend it for publication, although I think that the authors could discuss it in deeper detail. Specially, I miss for instance a more elaborate discussion of the results in our understanding of the similarities and differences between clades that have been published in the last years (I have not been able to find some relevant articles in this regard cited in the bibliography). I also feel that a deeper analysis of the differences between haplotypes could be very interesting, unless they are artifactual effects of the assemblies. As mentioned below, unless this is part of a longer story for a different MS beyond the scope of this one, I encourage the authors to validate some of the differences they find between haplotypes, and try to correlate the structural variations, with differences in gene counts between haplotypes, and to explore whether these differences could be correlated with aspects of biological relevance. I miss, for instance, Venn diagrams with gene contents between previous assemblies, and the haplotypes/haploid genome here reported. In any case, I firmly recommend this MS for publications, since most of my suggestions are not intended to interrogate the results of the MS, but to improve it, but I also understand that some may go beyond the scope of this MS.
  
  Minor points: Introduction Page 1: "the basic body plan of adult tunicates is highly conserved across the entire subphylum [3]". This sentence, which could be OK for ascidians, probably provides a highly simplified vision of Tunicate adult morphologies, specially comparing the divergent morphologies of Thaliaceans and Appendicularians. Please, elaborate the sentence.
  
  To understand the comparisons between the data of this MS and previously reported genomes, it seems crucial to understand well the meaning of the "clades and subclades". Please, include in the introduction (or where needed), how are defined those clades, which are their origins and biological/geographical differences, … and all the critical information that will specially help non-tunicate readers to understand the results.
  
  Results: The authors refer to the presence of large-scale genomic palindromes in Bs1 and Bs3. But it is unclear what are these structures. I suggest to please provide some more detailed explanation about the palindromic nature of these regions.
  
  The data of haplotype-resolved assemblies is very interesting. I wonder if it is possible to somehow measure the amount of heterozygosity between haplotype 1 and 2, and those versus the previous versions of the genome, to better understand intra and inter-variation between subclades? The differences of the size of some regions between Colombera and this study, and even between haplotypes 1 and 2, are very interesting. I would find more informative to merge the three graphs of Figure S9 into one single graph, so we can also easily compare the different in sizes of the haplotypes with the haploid. If some of those differences are actually due to deletions, that would deserve further analysis. If this analysis is not part of another ongoing project that will be published somewhere else, I suggest identifying with a dot-plot some of those differences, specially between haplotypes, and validate with long-reads crossing those regions whether some of the deletions are real or artifactual. Please, include the dotplot graph together with the two haplotypes in figure S10. In those cases that could be real, it would be very interesting what genes are gone, and if those are not placed somewhere else in the genome as result of translocations, or those genes are actually gone and could explain some of the differences reported in the gen count between haplotypes.
  
  The authors mentioned the presence of multiple structural variations, although some of which could be artifactual of miss-assemblies. Interestingly, the plot of the synteny blocks between the two haplotypes in figure S11 shows some of those structural variations, including cases of: - deletions: for instance, there are "blank" regions in Bs1A and Bs3A with no lines, which may reflect areas that are not present in the haplotype B. - duplications and translocations within chromosomes or between chromosomes of different haplotypes. Just looking to this plot, I wonder how the distribution of chromosomes between haplotypes is done. For instance, I see that Bs7B shares a duplicated synteny block with chromosomes Bs10B and Bs14B, but not with Bs10A and Bs10B, which means that the duplications are intra-haplotype present in B but not in A. But I wonder if it is possible that Bs10B and Bs14B could be in fact switched to haplotype A, and therefore there would be no duplication nor deletion in one of the haplotypes, just a simple translocation. I may be wrong in the interpretation, but I'm curious to understand the graph. In any case, again, as mentioned above, it would be worthy to validate some of those variations with long reads, which could illuminate the biological relevance between the haplotypes and discard potential artifactual errors of the assemblies.
  
  I notice that in figures 7 and S13, some lines are thicker than others. Is this because many "thin" lines are overlapped, and they look like a "thick" line. Otherwise, the visual effect of different thicknesses could be misleading. Please, clarify.
  
  In the analysis of the Hox cluster the authors say "[…] our new assembly revealed that B. schlosseri's Hox genes are not scattered. Instead, eight of them were clustered on the second largest scaffold (Bs2), whereas two other ones are found on the 15th largest scaffold (Bs15)." Generally, the description of the Hox gene in a cluster refers to the fact they are in the vicinity, with near not many other genes in between Hox genes. Therefore, I would not describe that eight Hox genes are clustered by the simple fact that they are in the same chromosome (maybe even in different arms).
2. GigaScience 30 Sep 2025
  
  in GigaScience
  
  AbstractBotryllus schlosseri (Tunicata) is a colonial chordate that has long been studied for its multiple developmental pathways and regenerative abilities and its genetically determined allorecognition system based on a polymorphic locus that controls chimerism and cell parasitism. We present the first chromosome-level genome assembly from an isogenic colony of B. schlosseri clade A1 using a mix of long and short reads scaf-folded using Hi-C. This haploid assembly spans 533 Mb, of which 96% are found in 16 chromosome-scale scaffolds. With a BUSCO completeness of 91.2%, this complete and contiguous B. schlosseri genome assembly provides a valuable genomic resource for the scientific community and lays the foundation for future investigations into the molecular mechanisms underlying coloniality, regeneration, histocompatibility, and the immune system in tunicates.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf097), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 2: Tilman Schell
  
  Review of
  
  First chromosome-level genome assembly of the colonial chordate model Botryllus schlosseri (Tunicata)
  
  from
  
  Olivier De Thier, Marie Lebel, Mohammed M. Tawfeeq, Roland Faure, Philippe Dru, Simon Blanchoud, Alexandre Alié, Federico D. Brown, Jean-François Flot and Stefano Tiozzo
  
  Comments to the authors
  
  De Thier et al. present a high-quality chromosome scale de novo assembly of the tunicate Botryllus schlosseri from mainly PacBio HiFi and Arima Hi-C reads. Further WGS Illumina and ONT data was applied to resolve assembly errors or support the correctness of the assembly structure. Structural and functional annotations are conducted thoroughly. Downstream analyses include a synteny comparison of different Tunicata based on ancestral linkage groups and Hox genes.
  
  The manuscript is well written and methods are mostly described to ensure reproducibility. Despite the good shape of the manuscript, I would like to give some remarks, which should be addressed in a revised manuscript before publication.
  
  General remarks
  
  I like the quote in the beginning of the introduction.
  
  The authors conducted downstream analyses with different related tunicate genome assemblies on chromosome level. For assembly metrics, there is a comparison regarding BUSCO assessment only. I would point out the high quality of the B. schlosseri assembly in Table 2 and 4 by comparison with the other chromosome level and annotated tunicate genome assemblies as well.
  
  I am not an expert regarding tunicates, so please excuse my basic, curiosity driven question: In the results section "The laboratory model Sub-clade A1" you state that a part of COI is used as a barcode to differentiate ascidian species. In the introduction you state that wild colonies are able to fuse resulting in mixed genotypes. Since sample E derived from the wild at some point, it might be theoretically possible to have not only mixed nuclear genotypes but mixed mitotypes too. Depending on how old sample E is and how fast fixation of a mitotype can happen within a colony, this might be reflected in your data. Furthermore, this thought could be expanded to nuclear genotypes, which could hamper scientific findings.
  
  Contamination filtering was based on a sequence similarity search and taxonomic assignment of blobtools only. Despite blobtools/blobtoolkit was applied I was not able to find a blobplot in the supplemental files. I would like to encourage the authors to add blobplots before and after contamination filtering at least to the supplement. In my opinion, blobplots are most powerful when considering GC content and coverage in the first place - especially, when dealing with taxa, which are underrepresented in public databases. Therefore, using taxonomic assignment only for contamination filtering might generate false positives (e.g. conserved sequences across the tree of life with taxonomic assignment different than Chordata but with similar GC and coverage as the target) and false negatives (e.g. short sequences of the assembly, which couldn't be assigned with different GC and coverage as the target).
  
  In the paragraphs "Results and Discussion" (Haplotype-resolved assembly) as well as in "Methods" (Haploid genome assembly) you use the term "haploid assembly" multiple times. I find this term misleading, since the genome is not haploid and the assembly represents both haplotypes at the same time. I assume that primary contigs from hifiasm were used to generate this assembly. Therefore, I would suggest to e.g. call this assembly "based on primary contigs", "non phased", "haplotype mixed" or "haplotype unresolved" (as opposite to "haplotype resolved").
  
  Particular remarks
  
  Results and Discussion
  
  Sequencing and genome size estimation
  
  Table 1 Please specify what "round 1" and "round 2" are referring to. Was one library sequenced twice or were two different libraries created and sequenced?
  
  Haploid genome assembly
  
  "We identified 28 contigs that belong to spore-forming unicellular parasites of the microsporidia group [32]. This represents the first report of this fungal group in a tunicate species." Is this identification based on blobtools taxonomic assignment? This is not described in the methods. Furthermore, can you rule out that identification or taxonomic assignment is false positive? If not you should tune down the second sentence and maybe discuss this.
  
  "We then performed Hi-C scaffolding using YaHS [34], which reduced the number of contigs to 256, before [...]" Technically, scaffolding with yahs can only increase the number of contigs because original (hifiasm) contigs are split because of the Hi-C signal (at least as long the option --no-contig-ec isn't applied). I would substitute "contigs" with "sequences".
  
  "Finally, a manual curation was performed, resulting in an assembly made up of 16 major scaffolds [...]" Is there any previous study on the karyotype of B. schlosseri? If so, citing it here would strengthen your results. Otherwise, I would recommend to state the karyotypes or the number of chromosome scale scaffolds of other tunicates here and discuss, if your findings are in line.
  
  Table 2 Please substitute "No. of scaffolds" with "No. of sequences". Please add the contig N50 values. As pointed out above, I would like to see a comparison to the other chromosome level tunicate genome assemblies here, instead of showing basically the same stats twice.
  
  "[…] highlighted the presence of two large-scale genomic palindromes located within Bs1 and a smaller one in Bs3 (Figure 3)." The figure shows the presence but maybe you can highlight them in the figure and the caption even more?
  
  "To find out whether these palindromes may result from assembly artifacts [40], we checked the localization of the duplicated BUSCO genes along the chromosomes and did another run of CRAQ [...]" You could support your findings by showing an even coverage distribution within the palindromes, which is similar to the coverage distribution of whole assembly. Either as a histogram or a zoomed in version of the read coverage across reference as in the outer layer of the circos plot could show this nicely.
  
  Methods
  
  Sampling, DNA isolation, and sequencing
  
  "HiFi PacBio long reads" Please provide more details on how PacBio libraries (was it actually one library sequenced twice or two different libraries?) were created and sequenced. Were low or ultra-low protocols used? On which machine was sequencing conducted?
  
  RNA-seq data
  
  Is downloading public data a method? In any case you should cite the original papers and provide a list of accession numbers (supplement) but I would remove this paragraph and add the information to the paragraph "Genome annotation", e.g. "Public available RNA-seq reads [23, 25, 8] were aligned to the soft-masked assemblies [...]"
  
  Data preprocessing
  
  Depending on how the PacBio libraries were created and which PacBio machine was utilized for sequencing, you should state how HiFi calling was conducted (e.g. Sequel II) and how PCR adapter and duplicates were filtered out (e.g. ultra-low).
  
  Haploid genome assembly
  
  "To this aim, contigs were aligned to the NCBI nucleotide database (accessed 2023 March 18) using BLAST+ [78]" Please state the version of BLAST+.
  
  "Finally, a BLASTN search for fragments of the mitochondrial genome among the contigs was performed using the published complete mitochondrial genome of B. schlosseri (RefSeq NC_021463.1) [28]." Were the fragments filtered out based on the blast search? Please explain what was done in detail. Which hits were considered (e.g. cutoffs)? The mitochondrial genome of E was assembled with NOVOPlasty, which is by the way not stated in the methods but in the results only. Was the assembled mt genome of E added to the assembly, once the fragments were filtered out?
  
  Haplotype-resolved assembly
  
  If I understand correctly, the rapid curation pipeline was applied but no dual-curation was conducted. When aiming for haplotype-resolved assemblies, I would recommend to apply this method, e.g. concatenating both haplotypes and creating a combined contact map of haplotype 1 and 2, which can be curated as usual, with the advantage of being able to exchange (parts of) sequences between the haplotypes. In some cases phasing from hifiasm is not correct and can be easily corrected with this approach.
3. GigaScience 30 Sep 2025
  
  in GigaScience
  
  AbstractBotryllus schlosseri (Tunicata) is a colonial chordate that has long been studied for its multiple developmental pathways and regenerative abilities and its genetically determined allorecognition system based on a polymorphic locus that controls chimerism and cell parasitism. We present the first chromosome-level genome assembly from an isogenic colony of B. schlosseri clade A1 using a mix of long and short reads scaf-folded using Hi-C. This haploid assembly spans 533 Mb, of which 96% are found in 16 chromosome-scale scaffolds. With a BUSCO completeness of 91.2%, this complete and contiguous B. schlosseri genome assembly provides a valuable genomic resource for the scientific community and lays the foundation for future investigations into the molecular mechanisms underlying coloniality, regeneration, histocompatibility, and the immune system in tunicates.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf097), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 1: Jerome Hui
  
  In this manuscript, De Thier and colleagues reported the chromosomal level genome assembly of tunicate Botryllus schlosseri (Pallas, 1766) sub-clade A1. The methods used in this study are standard. B. schlosseri has been used as laboratory model in certain places to understand asexual development and regeneration for decades. Despite there was a draft quality genome published a decade ago (eLife 2013, 2:e00569), the authors here produced a high-quality phased genome based on modern technologies. In terms of genomic resources for this laboratory model, this is important and useful. The authors have also carried out analyses, including repeats, synteny, and Hox cluster genes. I also think some of these results are interesting. Below are my comments and suggestions for the authors to consider which hopefully can further improve the manuscript.
  
  Given the authors merged the results and discussion into one section, I would expect more discussion for several parts, including:
  
  a. Repeats - For now, the analysis is quite standard and the main text is relatively descriptive. The question to me is what have we learnt from understanding the repeats from B. schlosseri genome? The authors should tell the readers.
  
  b. Synteny analyses - This is an interesting finding. Extensive chromosomal rearrangement has also been discovered in other animals in recent. Can the authors further discuss these events?
  
  c. Hox gene analyses - Again, it is quite descriptive. Tunicates are well known for dispersed Hox cluster for decades. So what have we learnt from the situation of B. schlosseri which I would be glad to see if the authors can discuss them.
  
  Figure S14
  
  The authors should also show the bootstrap values on the key nodes.
  
  In addition, the authors should also use one more method to construct the Hox gene tree in addition to Maximum Likelihood method.
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2024.05.29.594498v1
www.biorxiv.org www.biorxiv.org

Haplotype-resolved reference genomes of the sea turtle clade unveil ultra-syntenic genomes with hotspots of divergence

3
1. GigaScience 30 Sep 2025
  
  in GigaScience
  
  AbstractBackground Reference genomes for the entire sea turtle clade have the potential to reveal the genetic basis of traits driving the ecological and phenotypic diversity in these ancient and iconic marine species. Furthermore, these genomic resources can support conservation efforts and deepen our understanding of their unique evolution.Results We present haplotype-resolved, chromosome-level reference genomes and high-quality gene annotations for five sea turtle species. This completes the catalog of reference genomes of the entire sea turtle clade when combined with our previously published reference genomes. Our analysis reveals remarkable genome synteny and collinearity across all species, despite the clade’s origin dating back more than 60 million years. Regions of high interspecific genetic distance and intraspecific genetic diversity are consistently clustered in genomic hotspots, which are enriched with genes coding for immune response proteins, olfactory receptors, zinc fingers, and G-protein-coupled receptors. These hotspot regions may offer insights into the genetic mechanisms driving phenotypic divergence among species, and represent areas of significant adaptive potential. Ancient demographic analysis revealed a synchronous population expansion among sea turtle species during the Pleistocene, with varying magnitudes of demographic change, likely shaped by their diverse ecological adaptations, and biogeographic contexts.Conclusions Our work provides genomic resources for exploring genetic diversity, evolutionary adaptations, and demographic histories of sea turtles. We outline genomic regions with increased diversity, linked to immune response, sensory evolution, and adaptation to varying environments that have historically been subject to strong diversifying selection, and likely will underpin sea turtle’s responses to future environmental change. These reference genomes can assist conservation by providing insights into the demographic and evolutionary processes that sustain and threaten these iconic species.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf105), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 3: Xiaoli Liu
  
  (1)It is recommended to add keywords such as "conservation genomics" or "adaptive evolution" to better align with the content. (2)In the background section, after discussing the current status of sea turtles and existing genomic research, the study's content is introduced directly without adequately explaining why it is necessary to sequence the genomes of the remaining five species of sea turtles on top of the existing partial genomic data. The introduction of the research objectives appears somewhat abrupt. (3)Last line of page four"Previous analyses in particular of the……within this ancient clade [34,38]"：When introducing the broad context of genomics and biodiversity conservation, it is important to provide detailed explanations for key concepts such as 'genomic synteny' and 'colinearity'. Although these concepts are covered later in the analysis of the turtle genome, providing initial elaboration can help readers better understand subsequent content. (4)Page 6 Section 2.2:The range of this quality value, 38.7, is incorrect. Please verify carefully. (5)Result 3.1：High conservation at the chromosomal level is supported, but repetitive sequences must be excluded from synteny analysis. (6)Section 3.4, Second Paragraph：The reliability of PSMC in low-diversity species, such as N. depressus, may be limited; it is recommended to validate findings with other methods, such as MSMC2. (7)It is recommended to include a detailed description of sample selection in the methods section, covering aspects such as geographic distribution, population size, and sample collection methods, to demonstrate the representativeness and reliability of the selected samples.
2. GigaScience 30 Sep 2025
  
  in GigaScience
  
  AbstractBackground Reference genomes for the entire sea turtle clade have the potential to reveal the genetic basis of traits driving the ecological and phenotypic diversity in these ancient and iconic marine species. Furthermore, these genomic resources can support conservation efforts and deepen our understanding of their unique evolution.Results We present haplotype-resolved, chromosome-level reference genomes and high-quality gene annotations for five sea turtle species. This completes the catalog of reference genomes of the entire sea turtle clade when combined with our previously published reference genomes. Our analysis reveals remarkable genome synteny and collinearity across all species, despite the clade’s origin dating back more than 60 million years. Regions of high interspecific genetic distance and intraspecific genetic diversity are consistently clustered in genomic hotspots, which are enriched with genes coding for immune response proteins, olfactory receptors, zinc fingers, and G-protein-coupled receptors. These hotspot regions may offer insights into the genetic mechanisms driving phenotypic divergence among species, and represent areas of significant adaptive potential. Ancient demographic analysis revealed a synchronous population expansion among sea turtle species during the Pleistocene, with varying magnitudes of demographic change, likely shaped by their diverse ecological adaptations, and biogeographic contexts.Conclusions Our work provides genomic resources for exploring genetic diversity, evolutionary adaptations, and demographic histories of sea turtles. We outline genomic regions with increased diversity, linked to immune response, sensory evolution, and adaptation to varying environments that have historically been subject to strong diversifying selection, and likely will underpin sea turtle’s responses to future environmental change. These reference genomes can assist conservation by providing insights into the demographic and evolutionary processes that sustain and threaten these iconic species.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf105), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 2: Brendan Reid
  
  The authors of this work provide a fantastic addition to the genomic resources currently available for marine turtles with five new, apparently high-quality reference genomes. These new resources enable a number of interesting cross-species analyses in this group, including phylogenetic reconstruction, inference of demographic history, and identification of hotspots of diversity and divergence. I though this paper was quite clearly written and easy to read overall, and I have one major and a few more minor comments/suggestions.
  
  Major comment: there is an extensive literature on hybridization among marine turtle lineages (see Vilaca et al. 2021, https://doi.org/10.1111/mec.16113, for a recent genomic example), with lots of evidence for ancient gene flow after initial lineage divergence as well as recent hybridization. The authors do not really mention this phenomenon at all, and since I think it has a lot of bearing on all of the results it would make sense to re-think your findings in light of the fact that some level of gene flow has occurred. Would extensive synteny/lack of genomic rearrangements potentially enable hybridization? Is overall low divergence among lineages potentially a function of gene flow? Are regions of high divergence the result of selection (as you suggest), or could these regions potentially be resistant to gene flow? I believe that IQtree assumes a strictly bifurcating tree, and gene flow can influence PSMC inferences (see Mazet et al. 2016, https://doi.org/10.1038/hdy.2015.104) - how would gene flow among lineages affect your inference of divergence dates and demographic histories?
  
  MInor commentsL [note - line numbers would have been helpful for providing comments on specific items! I will refer to the lower-left page numbers and paragraph instead]:
  
  page 3, paragraph 2: Some of the applications you refer to here don't seem terribly germane to the relevance of "genomic resources" in management and conservation per se, and several are just methods using some kind of genetic data ... e.g., "abundance"/close-kin mark recapture doesn't require full genomes (and the reference you cite used microsat data), and the "community"/eDNA applications don't generally rely on genomes but instead on databases of a few (usually mitochondrial) genes. Either include methods that truly benefit from the development of high-quality reference genomes or broaden this to something like "growth in molecular ecology techniques".
  
  page 4, paragraph 2: last sentence is a bit of a run-on, could break this up a bit.
  
  page 10, paragraph 3: for me, the ROH methods need some additional explanation and interpretation. The more detailed methods indicate that the ROH were identified on the basis of lower-than-average heterozygosity rather than true homozygosity - I can understand why this might have been done (since the baseline level of heterozygosity varies across species) but it still seems a bit arbitrary and could risk mistaking stretches with simply low variation for IBD tracts. I wonder if a ROH-detection method like ROHan that explicitly incorporates baseline genomic heterozygosity into its model would be more appropriate for comparing results across species and could give different results. I also question a bit the interpretation of these low-diversity tracts as evidence of inbreeding per se. The authors do not comment much on the length distributions of these ROH - given that many of them are quite short I would expect that if there was mating between close kin it probably happened far back in the past and the IBD tracts have been broken up by recombination.
  
  page 11, paragraph 2: for PSMC analyses it is important to note the method assumes that differences in coalescence time/Ne across the genome result from demography alone. If portions of the genome are under balancing/diversifying selection (such as the areas of high diversity that you detect in this study), the local Ne for inferred these regions would be expected to be larger than the rest of the genome, which could lead to the spurious detection of population expansion or contraction (more likely a contraction for balancing selection). See Boitard et al. 2022 (https://doi.org/10.1093/genetics/iyac008) for a more detailed treatement. I would try excluding the regions putatively under diversifying selection and re-run PSMC to see if your inferences change.
3. GigaScience 30 Sep 2025
  
  in GigaScience
  
  AbstractBackground Reference genomes for the entire sea turtle clade have the potential to reveal the genetic basis of traits driving the ecological and phenotypic diversity in these ancient and iconic marine species. Furthermore, these genomic resources can support conservation efforts and deepen our understanding of their unique evolution.Results We present haplotype-resolved, chromosome-level reference genomes and high-quality gene annotations for five sea turtle species. This completes the catalog of reference genomes of the entire sea turtle clade when combined with our previously published reference genomes. Our analysis reveals remarkable genome synteny and collinearity across all species, despite the clade’s origin dating back more than 60 million years. Regions of high interspecific genetic distance and intraspecific genetic diversity are consistently clustered in genomic hotspots, which are enriched with genes coding for immune response proteins, olfactory receptors, zinc fingers, and G-protein-coupled receptors. These hotspot regions may offer insights into the genetic mechanisms driving phenotypic divergence among species, and represent areas of significant adaptive potential. Ancient demographic analysis revealed a synchronous population expansion among sea turtle species during the Pleistocene, with varying magnitudes of demographic change, likely shaped by their diverse ecological adaptations, and biogeographic contexts.Conclusions Our work provides genomic resources for exploring genetic diversity, evolutionary adaptations, and demographic histories of sea turtles. We outline genomic regions with increased diversity, linked to immune response, sensory evolution, and adaptation to varying environments that have historically been subject to strong diversifying selection, and likely will underpin sea turtle’s responses to future environmental change. These reference genomes can assist conservation by providing insights into the demographic and evolutionary processes that sustain and threaten these iconic species.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf105), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 1: Laura Caquelin
  
  Summary of the Study The authors aimed to create high-quality reference genomes for five sea turtle species to better understand their genetic diversity, evolutionary adaptations, and ecological traits. They used haplotype-resolved, chromosome-level reference genomes and gene annotations to reveal conserved genome structures, genetic hotspots linked to immune response and sensory evolution, and patterns of demographic expansion. Their findings highlight areas of genetic diversity critical for adaptation and conservation efforts.
  
  Scope of reproducibility
  
  According to our assessment the primary objective is: Investigation of multi-copy gene family enrichment in genomic hotspots of sea turtles.
  
  Outcome: Significant enrichment of "MHC", "Immunology-related", "G-Protein Coupled Receptor" (GPCR), "Olfactory Receptor" or "Zinc-Finger" in genomic hotspots with high genetic divergence, diversity, and gene density.
  
  Analysis method outcome: Fisher's exact test followed by Benjamini-Hochberg correction
  
  Main result: "Following functional annotation of the genes found in these hotspots, we found enrichment for multi-copy gene families coding for proteins with functions in immune response, olfactory receptors (ORs), zinc fingers, and G-protein-coupled receptors (GPCRs_ (Fig 4c, Tables S6 & S7). This included enrichment of immunology-related genes, GPCRs, ORs, and Zinc-finger genes in chromosome 13 (adjusted p < 10-42, 10-47, 10-79, 0.01, respectively), MHC genes, Immunology-related genes, GPCRs, ORs, and Zinc-finger genes in chromosome 14 (adjusted p < 10-24, 10-6, 10-2, 10-10, 10-52, respectively) and Immunology-related genes and GPCRs in chromosome 24 (adjusted p < 10-3 and 10-3, respectively)." (page 10).
  
  Availability of Materials a. Data
  
  Data availability: Open
  
  Data completeness: Complete
  
  Access Method: Repository
  
  Repository: https://git.imp.fu-berlin.de/begendiv/sea_turtlegenomes
  
  Data quality: The data files have been shared and appear sufficient for running the analyses. However, no metadata is provided to describe the content, structure, or origin of the files which limits interpretability and reusability. b. Code
  
  Code availability: Open
  
  Programming Language(s): R (for the enrichment test)
  
  Repository link: https://git.imp.fu-berlin.de/begendiv/sea_turtlegenomes
  
  License: MIT license
  
  Repository status: Public
  
  Documentation: Short README, describe only the presentation of the directory.
  
  Computational environment of reproduction analysis
  
  Operating system for reproduction: MacOS 14.7.4
  
  Programming Language(s): R
  
  Code implementation approach: Using shared code
  
  Version environment for reproduction: R version 4.4.1/RStudio 2024.09.0
  
  Results
  
  5.1 Original study results
  
  Results 1: The main results are presented in Figure 4 and the numerical p-values are available on supplementary table 6 and table 7.
  
  5.3 Steps for reproduction -> Run the code "enrichment_test.R" shared on Git - Issue 1: Files needed to run the code are not shared in the Git repository: "GCF_009764565.3_rDerCor1.pri.v4_genomic.longest.aa.tsv", "hotspots_chr13.longest.aa.tsv", "hotspots_chr14.longest.aa.tsv", "hotspots_chr24.longest.aa.tsv". -- Resolved: These analysis data are not shared in the internal Gigascience FTP server or the Git repository. After request, the authors uploaded all the files into the Git repository.
  
  5.4 Statistical comparison Original vs Reproduced results - Results: The table S6 and S7 was reproduced: -- Supplementary table S6: see screenshot from R console -- Supplementary table S7: see screenshot from R console
  
  Comments: The original R code "enrichment_test.R" simply stored the p-values results in a value object. To simplify the comparison process, directly obtain the final table, and ensure reproducibility while minimizing errors, we implemented the creation of the table.
  
  ------------------ Start of R code ------------------ Creating final tables Corresponding to supplementary table S6 table_S6 <- data.frame( enrichment = c("MHC", "Immunology", "GPCR", "Olfactory", "Zinc-finger"), Chr13 = c(p_mhc13, p_immune13, p_gpcr13, p_or13, p_zinc13), Chr14 = c(p_mhc14, p_immune14, p_gpcr14, p_or14, p_zinc14), Chr24 = c(p_mhc24, p_immune24, p_gpcr24, p_or24, p_zinc24))
  
  Corresponding to supplementary table S7 Create a vector of names for rows and columns ( ! warning the pvalues in fdrs are not in the same order as the table S7) enrichment <- c("MHC", "Olfactory", "GPCR", "Immunology", "Zinc-finger") chromosomes <- c("Chr13", "Chr14", "Chr24")
  
  Reorganizing fdrs in a matrix table_S7 <- matrix(fdrs, nrow = length(enrichment), byrow = TRUE) rownames(table_S7) <- enrichment colnames(table_S7) <- chromosomes
  
  Organizing rows as the original table S7 library(dplyr) table_S7 <- as.data.frame(table_S7) # Convert matrix to data frame table_S7 <- table_S7 %>% slice(match(c("MHC", "Immunology", "GPCR", "Olfactory", "Zinc-finger"), enrichment)) ------------------- End of R code -------------------
  
  Errors detected: The statement "MHC genes, Immunology-related genes, GPCRs, ORs, and Zinc-finger genes in chromosome 14 (adjusted p < 10^-24, 10^-6, 10^-2, 10^-10, 10^-52, respectively)" (page 10) appears to contain an error. Specifically, the p-value for Olfactory Receptors (5.583367e-10) is greater than the threshold of 10^-10, suggesting that this value should instead be below 10^-9. Therefore, the threshold for Olfactory Receptors should be revised to 10^-9.
  
  Statistical Consistency: The p-values are consistent (see screenshot from R console).
  
  Conclusion
  
  Summary of the computational reproducibility review The inferential statistics for the objective "Investigation of multi-copy gene family enrichment in genomic hotspots of sea turtles" were successfully reproduced using the original analysis code provided by the authors. The input data needed to run the code were initially unavailable but were subsequently shared through the Git repository. An inconsistency was noted in the text of the manuscript reporting a threshold for Olfactory Receptors, where the stated 10^-10 should be revised to 10^-9 based on the observed p-value (5.583367e-10).
  
  Recommendations for authors While the original analysis code was successfully used to reproduce the results, we recommend improving the documentation to enhance clarity and reproducibility. In particular: -- Code annotation: The scripts would benefit from more detailed comments within the code to clarify the logic of each step. This would greatly help users follow the analyses more easily and understand the purpose of specific commands or operations. -- README file: The current README provides only a general overview. We suggest expanding it to include: --- A brief description of each script or analysis pipeline. --- An indication of which figure, table, or result in the manuscript each script corresponds to. --- Clear instructions on how to execute the analyses in the correct order, if applicable. -- Metadata: For the datasets used or generated by the scripts, it would be helpful to include accompanying metadata files that explain: --- The definition of each variable name. --- The origin of each dataset (raw, processed, etc). --- Any preprocessing steps applied before analysis. -- Data availability: At this stage, we have only verified the reproducibility of one part of the study. To facilitate full reproducibility of the entire study, we recommend sharing all necessary data files required to run every script present in the repository.
  
  These improvements would make the repository significantly more user-friendly and would strengthen the reproducibility of the study.
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2025.03.26.644878v1
www.biorxiv.org www.biorxiv.org

CNSistent integration and feature extraction from somatic copy number profiles

3
1. GigaScience 30 Sep 2025
  
  in GigaScience
  
  AbstractThe vast majority of cancers exhibit Somatic Copy Number Alterations (SCNAs)—gains and losses of variable regions of DNA. SCNAs can shape the phenotype of cancer cells, e.g. by increasing their proliferation rates, removing tumor suppressor genes, or immortalizing cells. While many SCNAs are unique to a patient, certain recurring patterns emerge as a result of shared selectional constraints or common mutational processes. To discover such patterns in a robust way, the size of the dataset is essential, which necessitates combining SCNA profiles from different cohorts, a non-trivial task.To achieve this, we developed CNSistent, a Python package for imputation, filtering, consistent segmentation, feature extraction, and visualization of cancer copy number profiles from heterogeneous datasets. We demonstrate the utility of CNSistent by applying it to the publicly available TCGA, PCAWG, and TRACERx cohorts. We compare different segmentation and aggregation strategies on cancer type and subtype classification tasks using deep convolutional neural networks. We demonstrate an increase in accuracy over training on individual cohorts and efficient transfer learning between cohorts. Using integrated gradients we investigate lung cancer classification results, highlighting SOX2 amplifications as the dominant copy number alteration in lung squamous cell carcinoma.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf104), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 3: Sampsa Hautaniemi
  
  Streck and Schwarz present a method, CNSintent, for consistent segmentation of copy-number data. The utility of the tool is demonstrated using three large cancer cohorts and a neural network classifier built upon the consistently segmented data. CNSintent can facilitate solving an important biomedical problem: the advanced analysis of copy-number data. The authors are lauded for their excellent Python code and thorough documentation. While the contribution is timely and likely important, there are several areas for improvement.
  
  The manuscript's readability could be better. There are typos, textual errors, and inconsistencies in figure captions, such as incorrect figure references or mismatched values between the text and figures. The "Consistent Segmentation" section is difficult to follow. It is unclear whether this step involves merging pre-existing breakpoints in the data to produce new, longer segments or if larger segments, such as whole chromosomes, are split into smaller, constant-sized segments. The writing suggests that segments are first merged and then split; however, later in the manuscript, they appear to be used separately. In our testing, combining these approaches did not yield meaningful results. Since consistent segmentation is the method's most critical step, we strongly suggest clarifying this section.
  
  The manuscript is unbalanced in its content, with excessive focus on the tool's application and the discoveries derived from it, rather than on the tool itself. This reduces the clarity of the key message. We recommend compressing the application section (deep learning in cancer classification) while expanding the tool description with additional explanations.
  
  It is also unclear what type of data the authors are using in the cancer classification section. To improve clarity, this information should be explicitly included in the methods section, detailing the sequencing strategy and copy-number tools used for each cohort.
  
  The methods section would benefit from a more detailed explanation of the CNSintent steps. Both Figure 1 and the text leave some parts unclear, particularly in the "Consistent Segmentation" section. Additionally, methods such as random forest and UMAP are only briefly mentioned in a supplementary figure rather than being described in the methods section. Moving these descriptions to the methods section would improve clarity.
  
  Figures are generally clear, but improving color differentiation would be beneficial. For example, in Figure 1, the dark red and dark orange shades are too similar, making them difficult to distinguish. A more optimized color scheme with slightly lighter tones (i.e., increased luminance) would enhance readability.
  
  The introduction promotes copy-number signatures; however, these signatures rely on segment lengths and unique breakpoints, which vary between samples. Since this method enforces consistent segmentation and breakpoints across all samples, its applicability to copy-number signatures is unclear. This should be discussed in the Discussion section or removed from the introduction.
  
  Out of curiosity: Is it possible to prioritize one type of segmentation over another? For instance, if both WGS and WES data are available, can CNSintent be configured to prioritize WGS calls? Similarly, some tools provide highly precise breakpoint calls that are valuable for detecting fusion genes or rearrangements. In such cases, it would be useful to prioritize these calls and harmonize results from other tools accordingly.
  
  Terminology Clarifications:
  
  Blacklist, blacklisted regions, gap regions, mask: These terms should be used consistently, particularly since blacklists can be applied at different processing stages. Notably, PCAWG blacklists samples, not regions. Segmentation: The term is commonly used in CNV analysis to refer to inferring continuous genomic segments from raw read counts or probe intensities. Here, it has a slightly different meaning—computing consistent breakpoints across all samples—so a more explicit definition would be helpful. Breakpoint merging/clustering: If these terms are synonymous, choosing one would improve readability. Coverage: Since "coverage" often refers to sequencing depth, a critical quality metric in DNA sequencing, it might be clearer to use "copy-number coverage" or a similar term. For example, the sentence "Next, samples with low coverage were removed using the…" could be ambiguous if read without context.
  
  At the end of the subsection "Explainability and the Effect of SOX2 Gene," the phrase "which exhibits significant local amplification in LUSC" should be revised to "which exhibits significant focal amplification in LUSC." The correct terminology is "focal" rather than "local," as established in Beroukhim et al. (2010).
2. GigaScience 30 Sep 2025
  
  in GigaScience
  
  AbstractThe vast majority of cancers exhibit Somatic Copy Number Alterations (SCNAs)—gains and losses of variable regions of DNA. SCNAs can shape the phenotype of cancer cells, e.g. by increasing their proliferation rates, removing tumor suppressor genes, or immortalizing cells. While many SCNAs are unique to a patient, certain recurring patterns emerge as a result of shared selectional constraints or common mutational processes. To discover such patterns in a robust way, the size of the dataset is essential, which necessitates combining SCNA profiles from different cohorts, a non-trivial task.To achieve this, we developed CNSistent, a Python package for imputation, filtering, consistent segmentation, feature extraction, and visualization of cancer copy number profiles from heterogeneous datasets. We demonstrate the utility of CNSistent by applying it to the publicly available TCGA, PCAWG, and TRACERx cohorts. We compare different segmentation and aggregation strategies on cancer type and subtype classification tasks using deep convolutional neural networks. We demonstrate an increase in accuracy over training on individual cohorts and efficient transfer learning between cohorts. Using integrated gradients we investigate lung cancer classification results, highlighting SOX2 amplifications as the dominant copy number alteration in lung squamous cell carcinoma.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf104), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 2: Ellen Visscher
  
  The paper introduces a python package for imputation, filtering, segmentation, feature extraction and visualisation of CNA profiles. It explains some of the elements of the package, and then demonstrates how data from multiple cohorts can be processed and combined using the package preprocessing pipeline. The authors then use processed data from 3 different cohorts to perform cancer type prediction using a CNN. From this, they get an interesting result to find a biomarker that differentiates two different lung cancers. Throughout, they show visualisations using their package. The package itself seems well documented and designed to be used. There is some clarification required in the methods section specifically around the CNN training and the models therein. There is also one major question of whether all the preprocessing steps are actually required for the downstream CNN analysis. Overall, however, this is a well written manuscript, providing a useful software tool for further analysis of CNA data.
  
  Major comments: - CNN section- how are the segments decided- is it based on all the training data, or just data in a batch? - Throughout the results pertaining to figure 3A-C, you call it test accuracy- to be clear is this is based on your CV hold outs? This should be reworded everywhere to reflect this. As cross validation indicates, this is not a test set and is a validation set- which is also the way you use it. - Regarding the above, you have a comment saying: "the best test accuracy without cross-validation was 92.34%". Could you please clarify what you mean by this. Only in the CNN section do you describe your training approach, which does not mention a test or separate validation set. - It reads slightly unclearly- you have a section called "model transfer", but are you training 3 different models- one per dataset? You only have one figure for training results which suggests one dataset, but then you have this section called model transfer? - Re all the above, please dedicate a small subsection in methods making this clearer. Are there dedicated test sets? If your main results are for aggregated data, then what are you testing on to ensure generalisability? What is the point of training the 3 different models on 3 different datasets? Perhaps it would make more sense to hold one dataset out as your test set. In some ways, that is what the model transfer is showing, but it would be less confusing to clarify that aim instead of suddenly introducing 3 models. - If the CNN architecture is essentially the same as in Attique et. al., the performance is basically the same and they use only CNs a gene locations- how does this demonstrate that the preprocessing from CNSistent is necessary or advantageous for this task? Maybe having a result which combines CN calls naively over gene locations and comparing to this across the aggregate datasets would be a good way of comparing? I.e showing that preproccessing does offer an advantage when combining different datasets together? Also because this is what you argue in your abstract. For this analysis you would have to make sure you also compare across the same samples to differentiate between filtering/other preprocessing steps. - In Figure 3I, you say "notice the similarity of chromosome 3 pattern for the correctly classified LUSC samples (red) and the misclassified ones (orange)". This is confusing because the orange and red are not similar. In fact for this whole section, it seems that figure 3I does not align with what you are saying?
  
  Minor comments/errors: - Clarification on why CNSistent needs a reference genome if it's dealing with segments? How is this information used- is it just for the known gaps? - Your caption of Supplementary Figure 1 has a typo about a breakpoint at 16 instead of 14. - You do not explain how you use the knee pt to filter (i.e is it samples above/below the knee pt.) - Your CNN graphic is difficult to interpret and non-standard. - CNN section should clarify at the beginning what the input is and what the output is (i.e a prediction that a sample belongs to a particular cancer type) before explaining the architectural details. - Even though you control for class imbalance, some cancer types are so poorly represented it is unlikely a CNN could learn that, you do kind of mention this in the discussion, but maybe some sort of minimum threshold for inclusion would make sense. - For Fig2D you refer to it as GND, but the axes/title says hemizygosity-are these things equivalent? E.g could have 3-3, low hemizygosity but not diploid? Or if it's aggregated across the whole genome its assumed equivalent? - There is a grammatical error "Runtimes decreased in a near-linearly with the number of compute cores" - You make a comment that "We therefore suspect some TCGA lung cancers might be cases of co-occurring adeno and squamous carcinomas." This is a possibility but given pleiotropy of many phenotypes- it may also be that the biomarker is not always unique to squamous carcinomas.
  
  Suggestions/Nice to haves: - Maybe make it clearer inside the paper what visualisations come with CNSistent. Looking at the software documentation, there's obviously a lot of useful visualisations that come with that- and some of them you have used in Figure 3 for e.g. - Given there are more total CN callers, maybe good to mention somewhere how CNSistent would work for total CNs only. - You remove profiles that you say are uninformative, could you not include this and then just show how accuracy correlates with no. of break-pts (for e.g). In some ways one might think that there could be useful information in few alteration profiles- because those alterations might be more upstream/causal. - The aggregation step could maybe affect downstream analysis. I.e taking the average could introduce CNs that were never called. Even using min/max- this implies a constant copy number in that region, which may lose information- e.g if it is a functional region having two diff CNs across gene might imply non-functionality. Did you explore the effect of aggregation step? Perhaps taking a small enough resolution of segment types would account for this anyway.
3. GigaScience 30 Sep 2025
  
  in GigaScience
  
  AbstractThe vast majority of cancers exhibit Somatic Copy Number Alterations (SCNAs)—gains and losses of variable regions of DNA. SCNAs can shape the phenotype of cancer cells, e.g. by increasing their proliferation rates, removing tumor suppressor genes, or immortalizing cells. While many SCNAs are unique to a patient, certain recurring patterns emerge as a result of shared selectional constraints or common mutational processes. To discover such patterns in a robust way, the size of the dataset is essential, which necessitates combining SCNA profiles from different cohorts, a non-trivial task.To achieve this, we developed CNSistent, a Python package for imputation, filtering, consistent segmentation, feature extraction, and visualization of cancer copy number profiles from heterogeneous datasets. We demonstrate the utility of CNSistent by applying it to the publicly available TCGA, PCAWG, and TRACERx cohorts. We compare different segmentation and aggregation strategies on cancer type and subtype classification tasks using deep convolutional neural networks. We demonstrate an increase in accuracy over training on individual cohorts and efficient transfer learning between cohorts. Using integrated gradients we investigate lung cancer classification results, highlighting SOX2 amplifications as the dominant copy number alteration in lung squamous cell carcinoma.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf104), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 1: Stefano Monti
  
  This is a well-written paper that aims to develop a tool that can integrate SCNA from large datasets possibly generated using different platforms to identify alteration patterns that are often undetected in smaller data subsets. Authors have used CNN-based method for integrating the data, extracting features and predicting cancer types from SCNA profiles. The tool has the potential to significantly simplify the integration and analysis of large scale SCNA studies. However, some (hopefully addressable) weaknesses are noted:
  
  The choice of a classification task as the (only) way to evaluate the proposed method is questioned. I would argue that the most important use of SCNA detection is in support of mechanistic investigations, by identifying novel candidate loci likely to harbor tumor suppressors (copy losses) and oncogenes (copy gains). This type of analysis is hardly mentioned in the manuscript, and it is not clear how well the proposed tool would support it. I surmise it can, but the authors should discuss (and present results about) it.
  
  If we were to focus on the task of recurrent SCNA detection, then meta-analysis approaches (where separate analyses are performed on each of the datasets, and only the results are integrated) would need to be considered as an alternative to the approach here proposed (e.g., application of GISTIC to each of PCAWG, TCGA, TRACERx separately, followed by meta-analysis integration of the results). I am not saying meta-analysis would be superior, but the authors should discuss it, and possibly evaluate it.
  
  The reported metrics to quantify the quality of the integration are insufficient to assess the results. There is some lack of clarity about the classification accuracy results reported, since it is not clear whether all the components of the model building were adequately brought into the cross-validation (or train/test) loop. More specifically, when reporting the accuracy of the cancer type classification, it is reported that 1 megabase segmentation yields the best results. It is not clear if this size selection was performed within the train set only (and/or within the CV loop) or across the entire dataset. If the latter, this may significantly affect the accuracy results, which could not be deemed (unbiased) "test set" results. This should be clarified, and if the segment size selection was indeed performed outside the train/test split, accuracy measures should be computed again by performing the segment size selection properly (which of course it would mean a potentially different size would be selected for each of the folds).
  
  Comparisons with other methods: The authors only compare their method to random forest (RF). Related to the previous point: I presume the RF model used the segment size that was optimized for the CNN model (i.e., 1Mb). If this is the case, it would be an unfair comparison, since RF might favor a different size. Also, additional classifiers should be evaluated (e.g., Elastic Net, SVM, etc.).
  
  There is no sufficient discussion of existing tools/methods. This should be corrected (see also my comment about meta-analysis approaches).
  
  Metadata effects: Age influences the copy number alterations. The authors don't consider age or any other metadata and their implication in the classification task.
  
  Run time statistics and user requirement: While the authors report runtime curves per command (S Fig 6), it is difficult to translate this to total runtime. It would be useful if runtime for the entire training of a model were reported. Additionally, if available, comparison of run time stats with the established model that they cite would be useful.
  
  IG-based explanation. I found this section sort of perfunctory, not sufficiently justified, and adding little to the manuscript. IG is computationally expensive, and it does not provide any way to statistically quantify the found associations. Simpler methods, such as testing for association between SCNA occurrence and cancer type should be evaluated and compared to.
  
  Model selection: No adequate justification of why they picked CNN for this task when the referenced paper itself claims the DNN architecture performs better. Not sure but is this because of the varying segment size? Again, this is not clearly stated. https://pmc.ncbi.nlm.nih.gov/articles/PMC9203194/#tab1
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2024.12.23.630118v1
www.biorxiv.org www.biorxiv.org

Using synthetic RNA to benchmark poly(A) length inference from direct RNA sequencing

2
1. GigaScience 30 Sep 2025
  
  in GigaScience
  
  AbstractPolyadenylation is a dynamic process which is important in cellular physiology. Oxford Nanopore Technologies direct RNA-sequencing provides a strategy for sequencing the full-length RNA molecule and analysis of the transcriptome and epi-transcriptome. There are currently several tools available for poly(A) tail-length estimation, including well-established tools such as tailfindr and nanopolish, as well as two more recent deep learning models: Dorado and BoostNano. However, there has been limited benchmarking of the accuracy of these tools against gold-standard datasets. In this paper we evaluate four poly(A) estimation tools using synthetic RNA standards (Sequins), which have known poly(A) tail-lengths and provide a valuable approach to measuring the accuracy of poly(A) tail-length estimation. All four tools generate mean tail-length estimates which lie within 12% of the correct value. Overall, Dorado is recommended as the preferred approach due to its relatively fast run times, low coefficient of variation and ease of use with integration with base-calling.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf098), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 2: Jesse Daniel Brown
  
  This manuscript addresses a relevant and timely question: benchmarking poly(A) tail-length estimation tools (BoostNano, tailfindr, nanopolish, and Dorado) using synthetic RNA standards (Sequins) with known tail lengths. Poly(A) tail-length estimation is increasingly important for understanding mRNA stability, processing, and regulation at the single-molecule level. As direct RNA sequencing expands in use, reliable methods to measure poly(A) tail lengths are needed. The study's desiign—leveraging Sequins as a "gold standard" to benchmark tools—is strong and fills an area is need in current literature. The analysis is thorough in its basic comparisons, and the results are likely to be useful to researchers who need to choose suitable software for poly(A) tail analysis. However, the manuscript would benefit from deeper contextualization, more rigorous statistical methodology, and clearer reporting of computational details. Ensuring reproducibility and providing clearer guidance on interpreting the results in real biological contexts would strengthen the mannuscript. The suggestions below are aimed at making the study more valuable to the community. For this reason, my recommendation is Revisions ARE Needed
  
  Introduction
  
  Abstract: ★★★★☆ (4/5) Actually in place of the introduction, it has it strengths: The introduction adequately outlines why polyadenylation is biologically important and why direct RNA sequencing provides a unique opportunity for poly(A) tail-length estimation. It justifies the use of Sequins as synthetic standards, which is a robust approach to derive ground-truth tail lengths.
  
  Areas for Improvement:The introduction could better connect poly(A) tail-length estimation to downstream applications. For instance, mention how accurate tail-length estimation could improve understanding of mRNA decay rates, translation efficiency, or isoform-specific regulation.
  
  Adding references that contextualize poly(A) tail dynamics in broader biological phenomena would help readers understand the significance. For example, it is almost a necessity to cite work such as "Roles of mRNA poly(A) tails in regulation of eukaryotic gene expression" by Lori A. Passmore & Jeff Coller (2022, Nature Reviews Molecular Cell Biology) which provides a comprehensive analysis of poly(A) tail dynamics and their impact on mRNA decay, stability, and translation regulation. P & C (2022) also expands on these principles by discussing the mechanistic underpinnings of poly(A)-mediated decay and translation regulation, making it a broader and more recent contribution to polyadenylation biology, which the authors should consider.
  
  Grammar of the abstract: Error: "There are currently several tools available for poly(A) tail-length estimation, including well-established tools such as tailfindr and nanopolish, as well as two more recent deep learning models: Dorado and BoostNano." Suggestion: "Several tools are currently available for poly(A) tail-length estimation, including well-established methods like tailfindr and nanopolish, as well as two more recent deep learning models: Dorado and BoostNano."
  
  Error: "which lie within 12% of the correct value." Suggestion: "that lie within 12% of the correct value."
  
  Clarify the library preparation steps to avoid confusion about the "direct" nature of RNA sequencing. The text currently implies that no reverse transcription is required, but then references an ONT Reverse Transcription Adapter. Distinguish between a full-length cDNA synthesis step (not required) and the use of a poly(T)-containing adapter for sequencing library preparation.
  
  Methods
  
  Methods: ★★★★☆ (4/5) The methods section has its strengths; the data sources and preparation (Sequins spiked into host RNA) are clearly described. Versions of tools are provided, enhancing reproducibility.
  
  Areas for Improvement are statistical analysis, comparisons and tests, hardware and computation details, and understanding of run time differences. Currently, the study models distributions as normal and uses mean and SD, but no normality tests or justification for these choices are presented. Consider performing normality tests or using nonparametric measures. Additionally, providing confidence intervals or other robust statistics (median, interquartile ranges) would clarify variability.
  
  For the comparisons and tests, the authors should explain why you chose root mean square error (RMSE) minimization and other metrics. Could alternative tests, like Wilcoxon signed-rank tests or paired t-tests (Wilcocoxon: this non-parametric test is suitable for paired comparisons when the assumption of normality is not met. -useful to compare the predicted tail lengths from each tool against the expected lengths, especially if the data distribution is skewed.), be used to compare the distribution of tail-length estimates more rigorously? Paired t-Test, because this test could be applied if the normality assumption holds, providing a straightforward way to assess whether the mean difference between predicted and expected values is statistically significant. (If so, justification should be provided for why or why not)
  
  There are some additional metrics to explore: ---Median Absolute Deviation (MAD): Consider adding MAD as it is robust to outliers and could complement RMSE to provide a better understanding of central tendencies and variability. ---Mean Absolute Error (MAE): MAE is another alternative that simplifies the interpretation by focusing solely on the magnitude of errors without squaring them, potentially offering more intuitive insights for readers. The authors should address testing for normality, explicitly stating whether normality tests were conducted on the data (e.g., Shapiro-Wilk or Kolmogorov-Smirnov tests). If normality is confirmed, justify the use of parametric tests like RMSE or t-tests. If not, justify why non-parametric tests (e.g., Wilcoxon) were not employed or discuss plans to include them in future studies.
  
  Explain the choice of statistical methods over time by discussing how the choice of statistical tests aligns with the study's goals. For example, emphasize whether the focus was on understanding overall error distribution, tool consistency, or accuracy in predicting specific tail lengths.
  
  The authors could use visual representations of error complementing the statistical tests with visual aids such as boxplots, violin plots, or Bland-Altman plots to illustrate the error distributions and discrepancies between predicted and actual tail lengths across tools.
  
  The authors should provide hardware and computational details like providing explicit details on the computational environment—CPU/GPU models, RAM, OS—for each tool's run. While the Git-hub read me suggests how to run the system, it lacks any details about system requirements. Readers need this to understand runtime differences and attempt to replicate performance measurements.
  
  The authors should consider tool parameterization and indicate if any specific parameters (beyond defaults) were used in tailfindr, nanopolish, Dorado, or BoostNano runs. If no changes were made from defaults, state this explicitly.
  
  Results
  
  The result's strengths are that they are presented clearly, showing density distributions and discussing short-tail anomalies. The identification of Dorado as a preferred tool due to speed, integration, and conservative filtering is well-supported by the data. The study acknowledges that all tools achieve broadly similar accuracy, differing mainly in runtime and filtering criteria, which is a practical insight for users.
  
  The results have areas for improvement: Regrading the short-tail reads explanation, the authors attribute short (<10 nt) poly(A) tails to truncated transcripts or mis-priming. For this reason, it is suggested that the authors strengthen this discussion with additional evidence or reasoning. For instance, is there a correlation between read quality and short-tail length estimates? Do truncated reads consistently align to internal A-rich stretches? Multiple peaks in distributions: Some density plots (Figure 1) show multiple peaks or shoulder peaks. Discuss potential reasons for these patterns. Are they related to tool-specific biases, read quality, or adapter/poly(T) truncation? Application Context: The results focus on method performance, but it would help readers to understand how these differences might influence downstream tasks. For example, if a method overestimates poly(A) length slightly, how could this affect conclusions about RNA stability or differential tail-length analysis between experimental conditions? Figures and tables: Figure 1: Clear density plots, but consider adding vertical lines at expected tail lengths (30 nt and 60 nt) to guide interpretation. Splitting the figure into separate panels for R1 and R2 or using insets might clarify multiple peaks. Figure 2: The IGV snapshots are informative. Enhance interpretability by adding annotations (arrows or boxes) highlighting truncated vs. full-length reads. Increase font sizes for readability. Figure 3: Useful comparison of reads filtered by Dorado but retained by BoostNano. Add a brief note or labeling to indicate expected tail lengths. Discuss possible reasons for Dorado's conservative filtering here or in the main text. Tables: Provide definitions for abbreviations (nt, CPU, GPU) in captions. For Table 2, adding confidence intervals around the mean tail-length estimates would strengthen statistical rigor. For Table 3, specify hardware details as recommended above.
  
  Grammar Mistakes and errors in the results section: Results Section: Sentence: "The four methods display a similar pattern in the density distribution, with a prominent normal-like peak near the expected poly(A) length, but also with a over-representation of shorter poly(A) tails, ranging at approximately ~0-10 nt (Figure 1)." Issue: "a over-representation" Correction: "an over-representation"
  
  Sentence: "We expected that these shorter peaks were derived from either fragmentation of the transcript, mis-priming of internal poly(A) stretches or degradation of the poly(A) tails." Issue: tense mismatch ("expected" vs. "were derived"). Correction: "We expect" -- "were derived", loses context and tense contformity-- therefore the sentence should be adjusted- "We hypothesize that these shorter peaks are derived from either fragmentation of the transcript, mis-priming of internal poly(A) stretches, or degradation of the poly(A) tails."
  
  Sentence: "Interestingly, upon investigating these earlier peaks, we found that Dorado excludes reads which are retained in the analysis by BoostNano, despite them being classified as passed reads (Figure 3)." Issue: Ambiguous pronoun "them." (them could incorrectly identify three possible targets in the sentence) Correction: "Interestingly, upon investigating these earlier peaks, we found that Dorado excludes reads retained in the analysis by BoostNano, even though these reads are classified as passed reads (Figure 3)."
  
  Sentence: "Therefore, Dorado appears to be a more conservative approach than BoostNano." Issue: No grammar issues, but the statement could be more precise. Suggested improvement: "Thus, Dorado demonstrates a more conservative approach compared to BoostNano."
  
  Sentence: "In order to determine which normal distribution fit the peak best, we found the parameters (mean, SD) which minimize the root mean square error between the candidate normal distribution and the density distribution for an interval of 10 nt to the right of the mode." Issue: Verb tense consistency ("fit"). Correction: "To determine which normal distribution fits the peak best, ..."
  
  Sentence: "The peaks also lose their normal-like behavior for larger values." Issue: Could use a more formal tone. Correction: "The peaks also deviate from their normal-like behavior at larger values."
  
  Sentence: "Next, we compared the computational time required by each method to predict the tail-length of 4000 reads." Issue: Hyphenation of "tail-length." Correction: "Next, we compared the computational time required by each method to predict the tail length of 4,000 reads."
  
  Sentence: "BoostNano also offers the option of using the Application Programming Interface (API) call instead of the direct method, which omits the file copy implemented in the direct approach, reducing the run time to 8 m 8 s." Here, the sentence is extremely overwritten which cuases a lack of clarity. Correction: "BoostNano offers an alternative API-based method, which skips the file copy step of the direct approach, reducing the runtime to 8 minutes and 8 seconds."
  
  Discussion
  
  Discussion: ★★★☆☆ (3/5) The discussion as its strengths as it correctly identifies that Dorado's advantages (speed, integration with basecalling) make it appealing as a default choice. The authors acknowledge that all tools are within a similar accuracy range, suggesting the deciding factor may be speed or integration rather than raw performance differences. HOWEVER- there are areas for improvement: Further dissect the limitations of each tool. For example, BoostNano shows good SD but slightly off mean for R1; what does this mean for its use cases? Address the discrepancy between tailfindr, nanopolish, and Dorado in terms of how they define and detect poly(A) boundaries. Why does Dorado not evaluate start/end positions of poly(A) tails in event space, and how might this influence results? Include a brief discussion about how results might generalize to more complex transcriptomes. Real samples have varying GC content, fragment lengths, and potentially modified bases. A short commentary acknowledging these factors would show awareness that synthetic standards cannot capture the full complexity of natural RNA opulations. For these reasons, it is suggested that the authors suggest future directions. For instance, how could tool developers incorporate these findings to improve their methods? Could future benchmarking sets include a gradient of tail lengths to better understand length-specific biases?
  
  Grammar Mistakes and errors in the discussion section: Sentence: "BoostNano and tailfindr tools provided estimation of the starting and ending positions of the poly(A) tails in event space while this information was absent in Dorado outputs." Issue: "provided estimation" should be "provide estimation" to align with present tense. Correction: "BoostNano and tailfindr tools provide estimation of the starting and ending positions of the poly(A) tails in event space, while this information is absent in Dorado outputs."
  
  Sentence: "On the R1 dataset, BoostNano showed a tighter distribution with the smallest SD, but its peak was the furthest from the correct value." The issue here is that the test results are still speaking about genneral truths leading to verb tense inconsistency; "showed" should match other verbs in the section. Correction: "On the R1 dataset, BoostNano shows a tighter distribution with the smallest SD, but its peak is the furthest from the correct value."
  
  Sentence: "tailfindr had the most accurate estimation but also the largest error interval."
  
  The issue here is the verb tense mismatch; "had" should be consistent with present tense to show truth, not past truth. Correction: "tailfindr has the most accurate estimation but also the largest error interval."
  
  Sentence: "Furthermore, Boostnano is more lenient in keeping reads for poly(A) estimation than Dorado."
  
  Issue: "Boostnano" capitalization error; it should be "BoostNano." Correction: "Furthermore, BoostNano is more lenient in keeping reads for poly(A) estimation than Dorado."
  
  Sentence: "Overall, our results suggest that the four tools investigated in this study - BoostNano, tailfindr, nanopolish and Dorado have similar performance with their accuracy varying from one dataset to the other, with a potential length bias."
  
  Issue: Missing commas for clarity; replace "with their accuracy varying from one dataset to the other" for conciseness. Correction: "Overall, our results suggest that the four tools investigated in this study—BoostNano, tailfindr, nanopolish, and Dorado—have similar performance, with accuracy varying across datasets and showing potential length bias."
  
  Sentence: "Therefore, we expect Dorado to be implemented as the default method of poly(A) tail estimation in the near future, with the rapid estimation timeframe, comparable estimation lengths to other tools, conservative nature and the added benefit of ease of obtaining this information during basecalling."
  
  There are several issues here including verbosity and lack of parallelism. Correction: "Therefore, we expect Dorado to be implemented as the default method for poly(A) tail estimation, given its rapid estimation timeframe, comparable accuracy to other tools, conservative nature, and ease of integration with basecalling."
  
  Sentence: "This work demonstrates the value of having access to synthetic RNA molecules with known poly(A) tail-lengths for validating the accuracy of poly(A) tail estimation algorithms."
  
  Issue: The phrase "validating the accuracy of" could be simplified for readability. Correction: "This work demonstrates the value of synthetic RNA molecules with known poly(A) tail lengths for validating poly(A) tail estimation algorithms."
  
  Sentence: "As methods improve, we anticipate that these datasets will be valuable for assessing improvements in estimation of poly(A) tails."
  
  Issue: "improvements in estimation of" is awkward. Correction: "As methods improve, we anticipate that these datasets will be valuable for assessing advancements in poly(A) tail estimation."
  
  References need to be added to accomodate the suggested material review, but existing references are good-
  
  NEEDS REVISION Jesse Daniel Brown PD AASU
  
  Note:
  
  I previously reviewed this paper previously in Research Hub and you can read these comments via the Research Hub review page here: https://www.researchhub.com/paper/8634403/using-synthetic-rna-to-benchmark-polya-length-inference-from-direct-rna-sequencing/reviews#threadId=55398.
  
  The original preprint linked to the Research Hub review is here: https://doi.org/10.1101/2024.10.25.620206
2. GigaScience 30 Sep 2025
  
  in GigaScience
  
  AbstractPolyadenylation is a dynamic process which is important in cellular physiology. Oxford Nanopore Technologies direct RNA-sequencing provides a strategy for sequencing the full-length RNA molecule and analysis of the transcriptome and epi-transcriptome. There are currently several tools available for poly(A) tail-length estimation, including well-established tools such as tailfindr and nanopolish, as well as two more recent deep learning models: Dorado and BoostNano. However, there has been limited benchmarking of the accuracy of these tools against gold-standard datasets. In this paper we evaluate four poly(A) estimation tools using synthetic RNA standards (Sequins), which have known poly(A) tail-lengths and provide a valuable approach to measuring the accuracy of poly(A) tail-length estimation. All four tools generate mean tail-length estimates which lie within 12% of the correct value. Overall, Dorado is recommended as the preferred approach due to its relatively fast run times, low coefficient of variation and ease of use with integration with base-calling.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf098), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 1: Christoph Dieterich
  
  In this manuscript, the authors present a benchmark to assess the performance of different tools designed for estimation of polyA tail length from Nanopore direct RNA-sequencing data. These tools include tailfindr, nanopolish, Dorado and Boost Nano. Benchmarks on tools and algorithms to analyze Nanopore data, both third party tools and official ONT releases, are of utmost importance for the field. The use of synthetic constructs with known ground truth is recommended as well. Consequently, this study has the potential to provide a significant contribution to the field.
  
  In the current form, I can however not recommend it for publication in GigaScience. My major concerns are: a) Use of only RNA002 data. This chemistry is outdated and thus the Benchmark is only relevant for old, possibly already published data. A comprehensive Benchmark should also include RNA004 and available tools there (at least Dorado). b) The current data set only contains two polyA tail length, which are relatively short and do not cover longer polyA tails that are common e.g. in mammalian cells. A proper Benchmark should show the performance of the analyzed tools over a range of polyA tail lengths.
  
  Minor comments: 1) Abstract: "All four tools generate mean tail-length estimates which lie within 13% of the correct value." The value of 13% is given in the Abstract from the submission system, wherease the abstract in the Main text says 12%. Which value is correct? 2) Background, first paragraph: the role of the polyA tail in RNA circularization, which is required for efficient translation of cellular mRNAs is not mentioned. Reference is missing for "is increasingly recognised as a dynamic process which influences timing and degree of protein production." 3) Background, second paragraph: Chiron seems to be a relatively old basecaller (no models for new chemistries). It should be mentioned here that it is required for BoostNano. 4) Mis-priming of internal polyA sites may an important confounding (and currently overlooked) source of errors in Nanopore sequencing. This should be quantified properly and analyzed in more detail (length of these stretches, influence of other nucleotides within the A-rich stretch, etc.). Should be done as well on whole transcriptome data with more possible mispriming sites. 5) Why do the authors think that the poly(T) stretch of the RTA might be truncated? This is composed of DNA oligos, which should be quite stable 6) What are the parameters for filtering used by Dorado and BoostNano? Can the authors explain, why the filtered reads differ? 7) Dorado seems to systematically underestimate polyA tail length. Is this true also for data generated with RNA004 chemistry and longer polyA tails?
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2024.10.25.620206v1
www.biorxiv.org www.biorxiv.org

Nanopore- and AI-empowered microbial viability inference

2
1. GigaScience 30 Sep 2025
  
  in GigaScience
  
  AbstractThe ability to differentiate between viable and dead microorganisms in metagenomic data is crucial for various microbial inferences, ranging from assessing ecosystem functions of environmental microbiomes to inferring the virulence of potential pathogens from metagenomic analysis. While established viability-resolved genomic approaches are labor-intensive as well as biased and lacking in sensitivity, we here introduce a new fully computational framework that leverages nanopore sequencing technology to assess microbial viability directly from freely available nanopore signal data. Our approach utilizes deep neural networks to learn features from such raw nanopore signal data that can distinguish DNA from viable and dead microorganisms in a controlled experimental setting of UV-induced Escherichia cell death. The application of explainable AI tools then allows us to pinpoint the signal patterns in the nanopore raw data that allow the model to make viability predictions at high accuracy. Using the model predictions as well as explainable AI, we show that our framework can be leveraged in a real-world application to estimate the viability of obligate intracellular Chlamydia, where traditional culture-based methods suffer from inherently high false negative rates. This application shows that our viability model captures predictive patterns in the nanopore signal that can be utilized to predict viability across taxonomic boundaries. We finally show the limits of our model’s generalizability through antibiotic exposure of a simple mock microbial community, where a new model specific to the killing method had to be trained to obtain accurate viability predictions. While the potential of our computational framework’s generalizability and applicability to metagenomic studies needs to be assessed in more detail, we here demonstrate for the first time the analysis of freely available nanopore signal data to infer the viability of microorganisms, with many potential applications in environmental, veterinary, and clinical settings.Author summary Metagenomics investigates the entirety of DNA isolated from an environment or a sample to holistically understand microbial diversity in terms of known and newly discovered microorganisms and their ecosystem functions. Unlike traditional culturing of microorganisms, genomic approaches are not able to differentiate between viable and dead microorganisms since DNA might persist under different environmental circumstances. The viability of microorganisms is, however, of importance when making inferences about a microorganism’s metabolic potential, a pathogen’s virulence, or an entire microbiome’s impact on its environment. As existing viability-resolved genomic approaches are labor-intensive, expensive, and lack sensitivity, we here investigate our hypothesis if freely available nanopore sequencing signal dat that captures DNA molecule information beyond the DNA sequence might be leveraged to infer such viability. This hypothesis assumes that DNA from dead microorganisms accumulates certain damage signatures that reflect microbial viability and can be read from nanopore signal data using fully computational frameworks. We here show first evidence that such a computational framework might be feasible by training a deep model on controlled experimental data to predict viability at high accuracy, exploring what the model has learned, and using it in a real-world application by application to a bacterial species of veterinary relevance. We finally show that a specific model has to be trained to accurately predict viability after antibiotic exposure of a mock microbial community. While the generalizability of our computational framework therefore needs to be assessed in much more detail, we here demonstrate that freely available data might be usable for relevant viability inferences in environmental, veterinary, and clinical settings.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf100), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 2: Jakob Wirbel
  
  Summary: Urel and colleagues present a novel computational method to predict viability from metagenomic sequencing data, using the Nanopore squiggle as input. The manuscript is well-written and present an interesting new application, bolstered in particular by the application of explainable AI. However, I have some concerns regarding the generalizability of their method, detailed below.
  
  Major: The way the authors try to exclude contamination in their C. abortus experiment is not optimal, since contaminatants might be at low abundance and therefore not assemble well (especially with the relatively low sequencing output overall). Instead, it would be better to map reads against the reference genome for C. abortus and check if reads predicted to be viable map or if they are unmapped in this test. Maybe viable reads instead map against a database of known contaminants, like skin-resident microbes or other known kit contaminants. (This could potentially bolster their model performance)
  
  The authors claim that their method generalizes well from E. coli to C. abortus, which were killed in two different ways (UV and heat shock). However, if I understood correctly, their extracted DNA was left in the lab for 5 days. During this time, could exposure to sunlight over time have led to similar chemical reactions (meaning twists/kinks in the DNA as well as pyrmidine dimers)? This might be a point to discuss or it could be easily tested by incubating the DNA of the heat-killed C. abortus in the dark.
  
  What is the time-frame of DNA degradation in which the model works best? The authors left the DNA for 5 days, but metagenomic samples are usually processed quite quickly. How would the model perform on samples that were only kept for 1 day after initial killing? At which time of incubation does the model not generalize anymore? For a potential application, it might be useful to know if DNA is viable or not, even if the cells died relatively recently (and in the dark).
  
  Code availability: The github looks great, but as a potential user of their method, I would not want to train my own model. Is it possible to host the model, maybe on Zenodo, so that it could be more useful as an application?
  
  Minor: Lines 96-100 read a bit like a Nanopore commercial and are not really relevant for this paper Line 182: shouldn't heat shock at 120 C inactivate enzymes? Line 206: it is curious to keep the default cutoff just because the results are fine. Why not optimize the F1 score, for example? Fig1B seems to indicate that a probability threshold of 0.48 or something would give a higher F1 score. The decision to keep the threshold at the default value seems arbitrary Line 275: interesting hypothesis. Did you observe quicker decay of pore viability in the dead versus the alive run? Could you provide the pore scan information over the time of the sequencing run as a supplement, maybe, to back up this hypothesis? Line 311: the number does not match the one in the table Line 331: the dead reads are very short. Could you compare just the length of the reads with the viability predictions? Are shorter reads more likely to be predicted to be non-viable? Fig 3a: what does normalized count mean? How about a standard histogram or density plot? Line 442: The most recent version of dorado is v0.8.2.; did you mean v0.4.2? Please adjust.
2. GigaScience 30 Sep 2025
  
  in GigaScience
  
  AbstractThe ability to differentiate between viable and dead microorganisms in metagenomic data is crucial for various microbial inferences, ranging from assessing ecosystem functions of environmental microbiomes to inferring the virulence of potential pathogens from metagenomic analysis. While established viability-resolved genomic approaches are labor-intensive as well as biased and lacking in sensitivity, we here introduce a new fully computational framework that leverages nanopore sequencing technology to assess microbial viability directly from freely available nanopore signal data. Our approach utilizes deep neural networks to learn features from such raw nanopore signal data that can distinguish DNA from viable and dead microorganisms in a controlled experimental setting of UV-induced Escherichia cell death. The application of explainable AI tools then allows us to pinpoint the signal patterns in the nanopore raw data that allow the model to make viability predictions at high accuracy. Using the model predictions as well as explainable AI, we show that our framework can be leveraged in a real-world application to estimate the viability of obligate intracellular Chlamydia, where traditional culture-based methods suffer from inherently high false negative rates. This application shows that our viability model captures predictive patterns in the nanopore signal that can be utilized to predict viability across taxonomic boundaries. We finally show the limits of our model’s generalizability through antibiotic exposure of a simple mock microbial community, where a new model specific to the killing method had to be trained to obtain accurate viability predictions. While the potential of our computational framework’s generalizability and applicability to metagenomic studies needs to be assessed in more detail, we here demonstrate for the first time the analysis of freely available nanopore signal data to infer the viability of microorganisms, with many potential applications in environmental, veterinary, and clinical settings.Author summary Metagenomics investigates the entirety of DNA isolated from an environment or a sample to holistically understand microbial diversity in terms of known and newly discovered microorganisms and their ecosystem functions. Unlike traditional culturing of microorganisms, genomic approaches are not able to differentiate between viable and dead microorganisms since DNA might persist under different environmental circumstances. The viability of microorganisms is, however, of importance when making inferences about a microorganism’s metabolic potential, a pathogen’s virulence, or an entire microbiome’s impact on its environment. As existing viability-resolved genomic approaches are labor-intensive, expensive, and lack sensitivity, we here investigate our hypothesis if freely available nanopore sequencing signal dat that captures DNA molecule information beyond the DNA sequence might be leveraged to infer such viability. This hypothesis assumes that DNA from dead microorganisms accumulates certain damage signatures that reflect microbial viability and can be read from nanopore signal data using fully computational frameworks. We here show first evidence that such a computational framework might be feasible by training a deep model on controlled experimental data to predict viability at high accuracy, exploring what the model has learned, and using it in a real-world application by application to a bacterial species of veterinary relevance. We finally show that a specific model has to be trained to accurately predict viability after antibiotic exposure of a mock microbial community. While the generalizability of our computational framework therefore needs to be assessed in much more detail, we here demonstrate that freely available data might be usable for relevant viability inferences in environmental, veterinary, and clinical settings.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf100), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 1: Finlay Maguire
  
  In this paper the authors train a ResNet-based model to predict whether individual 10,000 sample chunks of nanopore signal data originate from live or killed bacterial isolate cultures. From live and UV-killed (at exponential phase) E. coli K-12 cultures DNA was extracted and sequenced using separate R10.4.1 flowcells on a MinION. Signal data from each read in the live and dead extractions were then processed by discarding the first 1,500 samples and dividing the remaining signals into 10,000 sample chunks. These were then split into a balanced 60:20:20 train, test, and validation datasets with the constraint that no two chunks from the same read would end up in the same dataset (e.g., chunk 1 and chunk 2 of 1st read in the killed culture would hypothetically be separated into train and test). During this they also explored/compared the impact of chunk size, model architecture, and performance of a sequence based model using the E. coli data. With a nicely performed class-activation map and masking approach they then identified the signal regions most strongly associated with dead-predictions (such as twisting/kinking/pore blockage of DNA around pyrimidine dimers). Finally, they applied their trained model to a live and heat-killed Chlamydia abortus culture and compared their results to stained microscopy and propidium monoazide PCR measures of viability. They found equivalent performance on the C. abortus data to their E. coli data (despite a different killing-method and taxa).
  
  The manuscript is well written and the methods are clearly described (including well documented code and deposited data). The authors explainability methodology is excellent although it would have been nice to see a bit more in-depth interpretation of those results. The authors have also presented a convincing case that nanopore signal data does contain information that can be used to distinguish signal chunks from live and dead bacterial monocultures. This methods has the potential to be useful in clinical and environmental genomics if it can be extended to more heterogeneous metagenomic samples. However, despite the title and framing of this manuscript (i.e., "metagenomics"), their analyses do not involve any metagenomic data and their results so far do not demonstrate if this is fesible. Currently, the overall framing (and title) of the manuscript is not appropriate given the work performed at this point. Similarly, given that both E. coli and C. abortus "dead" cultures resulted in median read length less than half the live cultures, the authors do not fully make the case that the signal and ResNet approach is actually required relative to simpler baseline models. Finally, although they did evaluate performance on a complete separate dataset, the authors should at least explore/quantify the correlation of live/dead prediction across chunks of the same read given the default expectation of non-independence of signal chunks from the same read.
  
  Major - Although the title and framing of the paper suggest that the authors are classifying live and dead bacteria in metagenomic datasets, the actual experiments and method developed are entirely based around sequencing of cultured clonal bacterial isolates. Metagenomic datasets are going to have considerably more heterogeneity in viability, species composition, and DNA signal characteristics. Given this, the paper's title, introduction, and parts of the discussion are a bit of an oversell and inappropriate. This manuscript should be revised to more clearly reflect the work actually performed.
  
  This paper doesn't establish whether a ResNet + Signal approach actually outperforms a much simpler baseline. For example, given there is a clear extraction and median read-length differences between live and dead samples, it is possible that a much simpler logistic model using basic features such as read length and/or translocation could perform equivalently.
  
  Although the C. abortus analysis demonstrates limited impact of leakage, I'm still a bit concerned that the potential non-independence of chunks from the same read (i.e., chunk 1 and chunk 3 of the same read are more likely to share similar live/dead signal characteristics than Chunk 1 and 3 of different reads). By not having multiple chunks of the same read in the training, validation, or test datasets the authors may have avoided issues with longer-reads being more represented in their datasets. However, this has the potential to introduce data leakage between train and test set (which may impact generalisability when they attempt to extend this method to metagenomics). I think this paper would be improved by some exploration of the correlation of live/dead prediction across chunks of the same read. How often do different chunks of the same read disagree? How does this impact the overall performance of the model? Does taking the average prediction across chunks of the same read improve or degrade performance? Would this problem be better suited to a multiple instance learning approach (i.e., a live/dead label applied to all chunks from a single read) especially in more heterogeneous datasets? To what degree do longer reads with more chunks contribute disproportionately to the overall performance in the C. abortus dataset?
  
  Minor
  
  SRA records don't seem to be live yet (https://www.ncbi.nlm.nih.gov/sra?linkname=bioproject_sra_all&from_uid=1123127)
  
  Are the actual pod5 files available?
  
  Read-level performance should be analysed and reported.
  
  Figure 1B: the test subplot numbers are almost too small to read - they may benefit from being its own panel.
  
  Plot axes labels are not always clear (e.g., Figure 3) percentage of what? Chunks? or Reads? It would be nice to see consistent capitalisation of labels and legends.
  
  Predictions on viable E. coli and viable C. abortus seems surprisingly similar (91.44% vs 91.34% viable and 8.56% vs 8.66% dead) despite different taxa, potentially underlying viable cell proportion, and output probability densities. This would benefit from further discussion/analysis - do misclassified chunks have any common characteristics? Would you expect the E. coli to have similar microscopy/PCR measured viability percentage as the C. abortus.
  
  Would be good to see a bit more discussion/exploration of impact of mixed live/dead cells given ~37.6% viability measure in the C. abortus sample (e.g., how well do models perform with different ratios of live/dead reads) - could potentially be achieved using in-silico spike ins).
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2024.06.10.598221v2
www.biorxiv.org www.biorxiv.org

The Open Pediatric Cancer Project

2
1. GigaScience 30 Sep 2025
  
  in GigaScience
  
  AbstractBackground In 2019, the Open Pediatric Brain Tumor Atlas (OpenPBTA) was created as a global, collaborative open-science initiative to genomically characterize 1,074 pediatric brain tumors and 22 patient-derived cell lines. Here, we present an extension of the OpenPBTA called the Open Pediatric Cancer (OpenPedCan) Project, a harmonized open-source multi-omic dataset from 6,112 pediatric cancer patients with 7,096 tumor events across more than 100 histologies. Combined with RNA-Seq from the Genotype-Tissue Expression (GTEx) and The Cancer Genome Atlas (TCGA), OpenPedCan contains nearly 48,000 total biospecimens (24,002 tumor and 23,893 normal specimens).Findings We utilized Gabriella Miller Kids First (GMKF) workflows to harmonize WGS, WXS, RNA-seq, and Targeted Sequencing datasets to include somatic SNVs, InDels, CNVs, SVs, RNA expression, fusions, and splice variants. We integrated summarized CPTAC whole cell proteomics and phospho-proteomics data, miRNA-Seq data, and have developed a methylation array harmonization workflow to include m-values, beta-vales, and copy number calls. OpenPedCan contains reproducible, dockerized workflows in GitHub, CAVATICA, and Amazon Web Services (AWS) to deliver harmonized and processed data from over 60 scalable modules which can be leveraged both locally and on AWS. The processed data are released in a versioned manner and accessible through CAVATICA or AWS S3 download (from GitHub), and queryable through PedcBioPortal and the NCI’s pediatric Molecular Targets Platform. Notably, we have expanded PBTA molecular subtyping to include methylation information to align with the WHO 2021 Central Nervous System Tumor classifications, allowing us to create research-grade integrated diagnoses for these tumors.Conclusions OpenPedCan data and its reproducible analysis module framework are openly available and can be utilized and/or adapted by researchers to accelerate discovery, validation, and clinical translation.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf093), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 2: Jacek Majewski
  
  Shapiro et al. describe the Open Pediatric Cancer Project, a dataset, web portals, and a Github repository to facilitate data access, analysis, and encourage collaborations using pediatric cancer omics data. While the concept is inspired, it does not constitute a significant advance over the previously described OpenPBTA project. The goal of the manuscript may be to provide a pointer to the updated datasets and web resources, but this does not seem like a sufficient reason to publish. As far as I can tell, all of the information in the manuscript is already provided on the OpenPedCan Bioportal (which is really useful, to be fair) and on GitHub. To publish a manuscript just as a pointer to that information does not seem justifiable in my opinion.
  
  Major Concerns:
  
  Novelty and Validity of Key Features:
  
  The manuscript highlights several key features of OpenPedCan, including data harmonization, multi-omic integration, reproducibility, scalability, versioned data releases, accessibility, alignment with WHO 2021 classifications, and the open-source framework. However, these features are not novel. Many of them represent standard practices in the field. Moreover, some claims appear questionable: * Reproducibility: While the authors claim reproducibility, using OpenPedCan's dockerized workflows would require significant computational resources (e.g., 98GB of CPU) or expensive cloud services (e.g., AWS). * Accessibility: The platform's interface requires users to have a Gmail account, limiting its accessibility. Alternative login options should be considered. * Open-Source Framework: The manuscript does not adequately address how the framework handles access to controlled data, such as those integrated from external sources like TARGET and TCGA, which may require restricted access permissions.
  
  Lack of Novel Methodologies and Findings:
  
  While OpenPedCan integrates data from existing workflows and portals (e.g., Gabriella Miller Kids First, TCGA), the manuscript does not clearly outline novel methodologies or scientific contributions. Most prominently, the submission appears to be an incremental extension of the previous manuscript describing OpenPBTA published in Cell Genomics 2023. The only potentially novel components appear to be proteomics and molecular subtyping based on methylation, but no specific examples or case studies demonstrating the novelty or impact of these contributions are provided.
  
  Redundancy with Existing Tools:
  
  The manuscript states that OpenPedCan serves as a community resource for addressing research questions and providing orthogonal validation datasets. However, there is nothing presented in OpenPedCan that cannot already be achieved with existing tools. This makes the claim somewhat redundant, as the platform largely serves as a data integrator rather than offering unique capabilities.
  
  Minor Concerns:
  
  Splicing Analysis Module:
  
  The manuscript refers to a splicing analysis module (Figure 2: OpenPedCan Analysis Workflow), but there is no further description or discussion of this module within the text. Further elaboration is needed.
  
  Incomplete Module Descriptions:
  
  The manuscript describes several analysis modules, but it should provide more comprehensive descriptions of the analysis modules, especially the Splicing Analysis module.
  
  Additionally, the Molecular Subtyping component, based on molecular and methylation data, is the only module with a clear methodological explanation.
  
  Further clarification on the methods used in other modules would be beneficial.
2. GigaScience 30 Sep 2025
  
  in GigaScience
  
  AbstractBackground In 2019, the Open Pediatric Brain Tumor Atlas (OpenPBTA) was created as a global, collaborative open-science initiative to genomically characterize 1,074 pediatric brain tumors and 22 patient-derived cell lines. Here, we present an extension of the OpenPBTA called the Open Pediatric Cancer (OpenPedCan) Project, a harmonized open-source multi-omic dataset from 6,112 pediatric cancer patients with 7,096 tumor events across more than 100 histologies. Combined with RNA-Seq from the Genotype-Tissue Expression (GTEx) and The Cancer Genome Atlas (TCGA), OpenPedCan contains nearly 48,000 total biospecimens (24,002 tumor and 23,893 normal specimens).Findings We utilized Gabriella Miller Kids First (GMKF) workflows to harmonize WGS, WXS, RNA-seq, and Targeted Sequencing datasets to include somatic SNVs, InDels, CNVs, SVs, RNA expression, fusions, and splice variants. We integrated summarized CPTAC whole cell proteomics and phospho-proteomics data, miRNA-Seq data, and have developed a methylation array harmonization workflow to include m-values, beta-vales, and copy number calls. OpenPedCan contains reproducible, dockerized workflows in GitHub, CAVATICA, and Amazon Web Services (AWS) to deliver harmonized and processed data from over 60 scalable modules which can be leveraged both locally and on AWS. The processed data are released in a versioned manner and accessible through CAVATICA or AWS S3 download (from GitHub), and queryable through PedcBioPortal and the NCI’s pediatric Molecular Targets Platform. Notably, we have expanded PBTA molecular subtyping to include methylation information to align with the WHO 2021 Central Nervous System Tumor classifications, allowing us to create research-grade integrated diagnoses for these tumors.Conclusions OpenPedCan data and its reproducible analysis module framework are openly available and can be utilized and/or adapted by researchers to accelerate discovery, validation, and clinical translation.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf093), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 1: Stephen R Piccolo
  
  I love this type of work. This research will be invaluable to the wider research community of people studying pediatric cancers. It will save lots of time and frustration and move the field forward. The paper is well written. I have to admit that I am not well versed in all of the latest software tools and settings to use for processing all of the data types that the repository includes. So I cannot vouch for or against those. However, the tools that I am familiar with seem reasonable. I have a few comments / suggestions / questions.
  
  How is patient privacy maintained? Sorry if I missed this. The paper mentions the original sources of the data. However, if I understand correctly, OpenPBTA has reprocessed versions of the data. What processes are used to regulate access to versions of the data that must be kept secure? Perhaps I am misunderstanding the ideas behind how this works.
  
  Validation. It would be helpful if the paper could touch on the approach the authors use to ensure that data that they have (re)processed are valid. For example, are there any known findings that show up after the data have been reprocessed? Or are there other ways of assessing quality?
  
  The paper mentions TCGA and GTex. It also mentions that adult data are included. But I didn't see a clear rationale for doing this.
  
  The paper includes many links, some of which reference portions of the GitHub site. It would be best to display the URLs in the paper itself. It would also be useful to reference a Zenodo-archived version of the GitHub site so that there is a versioned record of the repository at the time of submission.
  
  Supplementary Table 1 has a tab with information about the patient metadata ("Biospecimen-level metadata and clinical data"). However, I didn't see details in the paper about how these were harmonized. How did the authors ensure that the metadata values come from disparate sources were used consistently? What expertise did they have? How did they resolve inconsistencies or missing data? Supplementary Table 1 indicates a definition and a data type for each of these fields. It would be much more useful to provide ontology term(s) for each of these fields so that the metadata were machine readable.
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2024.07.09.599086v3
www.biorxiv.org www.biorxiv.org

A comprehensive water buffalo pangenome reveals extensive structural variation linked to population specific signatures of selection

4
1. GigaScience 30 Sep 2025
  
  in GigaScience
  
  AbstractWater buffalo is a cornerstone livestock species in many low- and middle-income countries, yet major gaps persist in its genomic characterization—complicated by the divergent karyotypes of its two sub-species (swamp and river). Such genomic complexity makes water buffalo a particularly good candidate for the use of graph genomics, which can capture variation missed by linear reference approaches. However, the utility of this approach to improve water buffalo has been largely unexplored.We present a comprehensive pangenome that integrates four newly generated, highly contiguous assemblies of Pakistani river buffalo with available assemblies from both sub- species. This doubles the number of accessible high-quality river buffalo genomes and provides the most contiguous assemblies for the sub-species to date. Using the pangenome to assay variation across 711 global samples, we uncovered extensive genomic diversity, including thousands of large structural variants absent from the reference genome, spanning over 140 Mb of additional sequence. We demonstrate the utility of these data by identifying putative functional indels and structural variants linked to selective sweeps in key genes involved in productivity and immune response across 26 populations.This study represents one of the first successful applications of graph genomics in water buffalo and offers valuable insights into how integrating assemblies can transform analyses of water buffalo and other species with complex evolutionary histories. We anticipate that these assemblies, and the pangenome and putative functional structural variants we have released, will accelerate efforts to unlock water buffalo’s genetic potential, improving productivity and resilience in this economically important species.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf099), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 4: Wai Yee Low
  
  Review of "A comprehensive water buffalo pangenome reveals extensive structural variation linked to population specific signatures of selection". This is an impressive work at the frontier of buffalo genomics. I truly enjoy reading the work and my questions/comments are aimed at improving it further. My detailed comments are below: Line 30: I think it is better you include the actual number of publicly available assemblies used to create the pangenome graph. Line 71: There is now a swamp buffalo reference genome with annotation too (NCBI accession: PCC_UOA_SB_1v2). Perhaps consider to cite the swamp buffalo ref https://academic.oup.com/gigascience/article/doi/10.1093/gigascience/giae053/7753516 and rewrite the sentence to say a pangenome can be used for both swamp and river, but a single linear ref from either subspecies for read mapping is not good enough. Line 79: "highlighted" Line 82: What do you mean by "higher quality"? The assemblies have been discussed in this review: https://www.frontiersin.org/journals/genetics/articles/10.3389/fgene.2021.629861/full Line 105: Technically, the graph method for bovine species, which includes water buffalo, is being investigated by the Bovine Pangenome Consortium (BPC). However, nothing useful has been published on the buffalo graph but perhaps consider citing the BPC since your paper overlaps with it (https://genomebiology.biomedcentral.com/articles/10.1186/s13059-023-02975-0). Line 165: It will be good if you add a bit more context of the PanGenie method here as the researchers in buffalo community are not used to this. Additionally, it will be great if all code is made available on GitHub or as Supplementary Info. Line 170: To produce phase pangenome graph, don't you need all input assemblies to be phased? All are input assemblies phased? The UOA_WB_1 is locally phased, not phased throughout the genome. Line 235: "a list of 403 unrelated individuals." What does this translate to in terms that geneticists can understand? Do you mean siblings have been removed? Or individuals sharing the same grandparents were removed? Line 246: Can you please explain how did you get the coordinates to match between the GATK and PanGenie method? You'll need matching coordinates for concordance analysis. As I understand it, the GATK was based on UOA_WB_1? Line 254: Why these 3 chromosomes? Line 257: If you had not filtered for relatedness, how will it impact the selective sweep work? I think including some context will help the readers. Line 259: do you mean at least six samples per group? If yes, is 6 samples enough? Line 261: genotype quality less than 25 according to bcftools? Since you only used biallelic variants, please provide the breakdown between biallelic and multiallelic. Line 281: "… we first PacBio HiFi sequenced one female" Please rewrite this. Line 282: How common are these two breeds in percentage? Line 291: Is this already known? Perhaps cite the literature to show the agreement with previous studies? Fig 1D: This is a bit too small to see especially the SV distribution at the bottom. I can hardly see the median? Line 310: Why did you choose UOA_WB_1 as the reference? Line 311: the ~32.8 mil variants are comprised of SNPs as well? Fig 2: This is probably a panel of a figure but should not be the entire figure. The size of the circle indicates sample size but there should be a legend on the plot for this to say the sizes, right? Darker colour should be used to highlight the countries with samples instead of white? Maybe this could be a Supp figure too. Line 356: S Figure 4 and 5 should be main figures? You will need to annotate the abbreviation of sample-country in the legend of S Figure 5. Line 360: "To enable reuse we have made this dataset available …" The dataset should be made available to reviewers? Line 368: "76% of SNVs were called by both callers" 76% seem low. Also, called does not mean concordant. What is the concordance among called SNVs in both? Did the pangenome approach called most of the variants found in GATK? If not, what might be the reasons? Fig 3B: It is not immediately clear what the difference is, between non repetitive and repetitive regions. The overlapping text in the x-axes makes it hard to read. Line 390: "Analyses such as the study of selective sweeps or genome-wide association studies where low frequency variants are often filtered out will benefit less from the advantages of GATK, particularly given its longer run time." From here on, in this paragraph, it's Discussion, not Results. Line 418: Why human? Could you use cattle? Line 427: I tried the browser and not sure what I can learn from it. It will be helpful if there is a README with some examples on what can be explored. Line 450: How large before you considered it as larger variant? Is this ability to study larger variants still hold despite using only ~10 assemblies in the graph? The use of short reads for selective sweep study will still benefit from being able to incorporate these larger variants? As I understand it, the larger variants were found only from graph, not from the short reads. As such, the selective sweep may not be associated with any larger variants? Line 470: Fig S8 should be a main figure? Line 513: Instead of uniprot link, perhaps consider including this as Supplementary info or text. The info in the link may change in the future. Line 551: However, without scaffolding, the assemblies of Pakistani river buffalo may not be good enough to function as reference genomes for river buffalo? Line 552: When considering new bases, did you do this for each assembly independently or the new bases were discovered cumulatively? Line 581: Some of my questions at Line 450 can be discussed here. Line 586: Perhaps consider discussing the limitations of the small number of assemblies used to create the graph. As such, many SVs are likely still missing and we are still unable to properly assess allele frequency of these larger SVs. Additionally, while some SVs may not be considered as large in this work, it does not mean they have no impact.
2. GigaScience 30 Sep 2025
  
  in GigaScience
  
  AbstractWater buffalo is a cornerstone livestock species in many low- and middle-income countries, yet major gaps persist in its genomic characterization—complicated by the divergent karyotypes of its two sub-species (swamp and river). Such genomic complexity makes water buffalo a particularly good candidate for the use of graph genomics, which can capture variation missed by linear reference approaches. However, the utility of this approach to improve water buffalo has been largely unexplored.We present a comprehensive pangenome that integrates four newly generated, highly contiguous assemblies of Pakistani river buffalo with available assemblies from both sub- species. This doubles the number of accessible high-quality river buffalo genomes and provides the most contiguous assemblies for the sub-species to date. Using the pangenome to assay variation across 711 global samples, we uncovered extensive genomic diversity, including thousands of large structural variants absent from the reference genome, spanning over 140 Mb of additional sequence. We demonstrate the utility of these data by identifying putative functional indels and structural variants linked to selective sweeps in key genes involved in productivity and immune response across 26 populations.This study represents one of the first successful applications of graph genomics in water buffalo and offers valuable insights into how integrating assemblies can transform analyses of water buffalo and other species with complex evolutionary histories. We anticipate that these assemblies, and the pangenome and putative functional structural variants we have released, will accelerate efforts to unlock water buffalo’s genetic potential, improving productivity and resilience in this economically important species.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf099), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 3: Laura Caquelin
  
  SummaryoftheStudy This study used graph genomics to better characterize water buffalo genomes. By building a pangenome from new and existing assemblies, the authors analyzed 711 samples. These samples revealed structural variation. These results highlight the value of graph genomics. This method
  
  Scopeofreproducibility According to our assessment the primary objective is: to identify genomic variants within selective sweep regions in the water buffalo genome.
  
  Outcome: Enrichment of high-impact structural variants (SVs), insertions/deletions (indels) and single nucleotide variants (SNVs) in selective sweep regions.
  
  Analysis method outcome: Variants were compared between selective sweep regions and genome-wide. Fisher's exact test was used to assess enrichment of functional variants.
  
  Main result: "Prior to annotation, multiallelic variants were normalized by splitting them into separate biallelic entries, resulting in 6,159,686 indels, 28,669,966 SNVs, and 160,921 SVs entries. Within putative selective sweep regions we identified 208,862 indels, 997,500 SNVs and 6,748 SVs. Notably an enrichment of HIGH impact SVs, indels and SNVs were observed within selective sweep regions (Figure 5A, Supplementary Table S6), with 50-80% more variants in these areas having a HIGH impact compared to genome-wide. Among the high impact variants in selective sweep regions only 20% were SNVs, with the remainder being SVs and indels, suggesting high impact larger variants may underlie putative selective sweeps." (Lines 453 to 461)
  
  AvailabilityofMaterials a. Data
  
  Data availability: Open
  
  Data completeness: Complete, all data necessary to reproduce main results are available
  
  Access Method: Supplementary files - Repository: -
  
  Data quality: Structured b. Code
  
  Code availability: Shared for the review after request - Programming Language(s): R
  
  Repository link: -
  
  License: -
  
  Repository status: -
  
  Documentation: No documentation
  
  Computational environment of reproduction analysis
  
  Operating system for reproduction: MacOS 14.7.4
  
  Programming Language(s): R
  
  Code implementation approach: Creating script according to the methodology description/Using shared code
  
  Version environment for reproduction: R version 4.4.1/RStudio 2024.09.0
  
  Results 5.1 Original study results
  
  Results 1: Results are presented in Figure 5A. 5.2 Steps for reproduction -> Reproduce the results The code was not shared initially, but as the data were provided and the test was a Fisher's exact test, I wrote code to reproduce the p-values.
  
  Issue 1: P-values for the SNVs variant as well as the « Modifier » impact class were not provided. -- Resolved: Authors provided an updated Supplementary table S6 with exact numerical p-values for each variant and each impact class. The code "variantEnrichAtPeaks.R" to generate the Figure 5A and the Supplementary table S6 was also shared. New version of the supplementary Table S6: (see screenshot)
  
  The comparison between the reproduced results and the original results was then performed using the shared code. (Notably, the results from the R script written allowed for the generation of the same p-value as the one presented in Figure 5A).
  
  Issue 2: In the script "variantEnrichAtPeaks.R", only the figures were generated, not the new supplementary Table S6 with the numerical p-values. -- Resolved: Some code lines was added in the function "makePlot" to generate this table in addition to the figure.
  
  Line 159 to 178 of the script "variantEnrichAtPeaks_RCC."
  
  Supplementary table S6 (add)
  
  summary_table <- df %>% mutate( Type = variantType, Genome_Wide_Prop = Genome_wide / sum(Genome_wide), Selective_Sweep_peaks_Prop = Sweep / sum(Sweep), Ratio_of_proportions = Selective_Sweep_peaks_Prop / Genome_Wide_Prop) %>% left_join(pval_df, by = "Impact") %>% select( Impact, Type, Genome_Wide = Genome_wide, Selective_Sweep peaks = Sweep, Genome_Wide Prop = Genome_Wide_Prop, Selective_Sweep peaks Prop= Selective_Sweep_peaks_Prop, Ratio of proportions= Ratio_of_proportions, Fishers exact P = p_value)
  
  return(list(plot = p, summary_table = summary_table))
  
  5.3 Statistical comparison Original vs Reproduced results - Results: Figure and table S6 were reproduced for each variant type and impact: -- SVs type: (see screenshot) -- Indels type: (see screenshot) -- And SNVs type: (see screenshot)
  
  Comments: The shared code was used to compute the p-values and generated the Figures. Minor numerical error discrepancy was observed for some p-values, likely due to rounding differences. The p-values in the original Excel file appear to be stored with less decimal precision than those computed in R. This difference is negligible and does not indicate a reproducibility issue.
  
  Errors detected: No error detected.
  
  Statistical Consistency: The results were successfully reproduced with the share code.
  
  Conclusion
  
  Summary of the computational reproducibility review The Fisher's exact tests for enrichment across variant and impact categories, presented in Figure 5A of the manuscript, were successfully reproduced using the data in supplementary table S6 and the shared code. Results were consistent with the original, with only negligible rounding differences in p-values.
  
  Recommendations for authors We were able to reproduce study with the data and information provided in the Figure 5A description. To further improve transparency and ensure full reproducibility of your manuscript, the following recommendations are suggested: -- Make the codes to reproduce all analyses in the paper openly available to allow anyone to reproduce the results. Ideally, provide a README or requirements.txt file describing how to run the analysis, including software versions, packages, and dependencies. -- Include statistical outputs, such as exact p-values, in supplementary materials when possible. This ensures clarity and eases verification. Ideally, provide metadata: For the datasets used or generated by the scripts, it would be helpful to include accompanying metadata files that explain: --- The definition of each variable name. --- The origin of each dataset (raw, processed, etc). --- Any preprocessing steps applied before analysis.
3. GigaScience 30 Sep 2025
  
  in GigaScience
  
  AbstractWater buffalo is a cornerstone livestock species in many low- and middle-income countries, yet major gaps persist in its genomic characterization—complicated by the divergent karyotypes of its two sub-species (swamp and river). Such genomic complexity makes water buffalo a particularly good candidate for the use of graph genomics, which can capture variation missed by linear reference approaches. However, the utility of this approach to improve water buffalo has been largely unexplored.We present a comprehensive pangenome that integrates four newly generated, highly contiguous assemblies of Pakistani river buffalo with available assemblies from both sub- species. This doubles the number of accessible high-quality river buffalo genomes and provides the most contiguous assemblies for the sub-species to date. Using the pangenome to assay variation across 711 global samples, we uncovered extensive genomic diversity, including thousands of large structural variants absent from the reference genome, spanning over 140 Mb of additional sequence. We demonstrate the utility of these data by identifying putative functional indels and structural variants linked to selective sweeps in key genes involved in productivity and immune response across 26 populations.This study represents one of the first successful applications of graph genomics in water buffalo and offers valuable insights into how integrating assemblies can transform analyses of water buffalo and other species with complex evolutionary histories. We anticipate that these assemblies, and the pangenome and putative functional structural variants we have released, will accelerate efforts to unlock water buffalo’s genetic potential, improving productivity and resilience in this economically important species.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf099), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 2: Yi Zhang
  
  This manuscript presents the first high-quality, haplotype-resolved genome assemblies for two representative Pakistani river buffalo breeds (Nili Ravi and Azikheli), integrating them with existing assemblies to construct a water buffalo pangenome. The study leverages graph genomics to characterize structural variation (SV), identifying >140 Mb of non-reference sequence and 111,352 SVs. By genotyping of 711 global samples against this pangenome, the authors uncover population-specific selective sweeps linked to productivity, immunity, and adaptation traits, revealing potentially functional SVs, though these findings are limited by the absence of validation evidence and cross-study comparisons. The work highlights graph genomics as a transformative tool for integrative analyses of evolutionarily related species in an unbiased way and provides resources to accelerate buffalo breeding.
  
  General Comments 1.The study's methodology is rigorous, combining long-read assembly, graph-based genotyping (PanGenie), and population-level sweep scans. Nevertheless, the manuscript would benefit from discussion of graph limitations, such as bias against rare variants (Fig. 3B) and challenges in graph construction for species with karyotypic divergence. 2. The selection signature analyses were done across a number of population groups but the paper only showcases a limited selection of results. To strengthen the manuscript, the authors could concentrate on a consistent set of populations. This would enable a more in-depth examination of selective signals common across buffalo population groups or unique selective signals specific to certain groups. 3. It could be informative to conduct comparative analyses of selection signatures using variant datasets from PanGenie and GATK. This could reveal whether the pangenome approach might uncover important structural variants within selection signals that GATK fails to identify.
  
  Specific Comments 1. In Figure 1D and the main text, the rationale behind dividing the SVs into 40 sets is not clearly presented. If the interpretation is correct, the y-axis label of the bar graph should denote the number of SVs rather than size. Moreover, the main title "SVs Size Distribution" at the top seems more relevant to the box plots at the bottom. 2. Lines 325 - 326 state that the newly assembled pangenome graph exhibits a substantial increase in genome size compared to the existing reference genome. It is recommended that the authors describe the distribution of the 147,865,364 bp across the entire genome. Are they found more prevalent in specific regions of certain chromosomes? 3. In lines 410 - 412, there may be an issue with the citation of Table S2. The table contains 402 individuals, whereas the text mentions 282. 4. Figure 3 shows that, when using 30x samples in the variant calling comparison between Pangenie and GATK, there are still a large number of SNV variants detectable only by GATK. A more in-depth technical discussion of these differences would greatly enhance the reader's comprehension of these findings and the relative performance of the two methods. 5. To provide a more intuitive understanding of how SV can influence gene function and contribute to the traits, the authors could include a figure that displays an example gene structure along with the SV of interest within a selection signal peak.
4. GigaScience 30 Sep 2025
  
  in GigaScience
  
  AbstractWater buffalo is a cornerstone livestock species in many low- and middle-income countries, yet major gaps persist in its genomic characterization—complicated by the divergent karyotypes of its two sub-species (swamp and river). Such genomic complexity makes water buffalo a particularly good candidate for the use of graph genomics, which can capture variation missed by linear reference approaches. However, the utility of this approach to improve water buffalo has been largely unexplored.We present a comprehensive pangenome that integrates four newly generated, highly contiguous assemblies of Pakistani river buffalo with available assemblies from both sub- species. This doubles the number of accessible high-quality river buffalo genomes and provides the most contiguous assemblies for the sub-species to date. Using the pangenome to assay variation across 711 global samples, we uncovered extensive genomic diversity, including thousands of large structural variants absent from the reference genome, spanning over 140 Mb of additional sequence. We demonstrate the utility of these data by identifying putative functional indels and structural variants linked to selective sweeps in key genes involved in productivity and immune response across 26 populations.This study represents one of the first successful applications of graph genomics in water buffalo and offers valuable insights into how integrating assemblies can transform analyses of water buffalo and other species with complex evolutionary histories. We anticipate that these assemblies, and the pangenome and putative functional structural variants we have released, will accelerate efforts to unlock water buffalo’s genetic potential, improving productivity and resilience in this economically important species.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf099), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 1:Paul Stothard
  
  This well-written manuscript describes the generation of new genome assemblies for water buffalo and the construction of a pangenome graph that is used for variant calling and downstream analyses. The work is clearly described and the methods are appropriate given the goals of the study. The results are interesting and timely, and realistic limitations are stated. The manuscript should be of high interest to the water buffalo research community and to those interested in applying pangenome graphs to variant calling.
  
  I have minor comments that I believe should be addressed prior to publication.
  
  Minor comments:
  
  In the NCBI genomes database, the water buffalo assembly NDDB_SH_1 is listed as the current reference genome, not UOA_WB_1 as suggested in the manuscript. Perhaps the reference genome was recently reassigned?
  
  Lines 64-69: Lack of clarity regarding relationships among water buffalo populations: - Wording suggests single domestication event accounts for all domestic water buffalo. But, the river and swamp buffalo diverged prior to the domestication date. This is a contradiction. Clarify by mentioning that there were at least two independent domestication events (one for river buffalo and one for swamp buffalo). - Taxonomic terminology is inherently ambiguous for a few reasons, including: 1) The Bubalus arnee species comprises both wild river buffalo and wild swamp buffalo, which have not been assigned subspecies names. 2) Domestic water buffalo (including river and swamp buffalo) are assigned their own species name: Bubalus bubalis, despite being biologically the same species as Bubalus arnee. 3) Unlike their wild source populations, domesticated river buffalo and domesticated swamp buffalo are assigned their own species names, Bubalus bubalis bubalis and Bubalus bubalis carabanensis, respectively. - To address ambiguity regarding taxonomy and phylogeny of the buffalo populations, mention the full subspecies names (Bubalus bubalis bubalis, and Bubalus bubalis carabanensis).
  
  Line 82: "Although eight higher quality": higher quality than what?
  
  Line 177: Undefined acronym: "PAF".
  
  Line 216: "each unique biosamples": should be "each unique biosample".
  
  Line 272: Which SnpEff database was used for variant annotation?
  
  Line 286-287: Based on Table 1, the difference between the largest and the smallest water buffalo genome is 360 mega base pairs. That exceeds the length of the largest chromosome by almost 2 fold, and is 14% of the total length of the UOA_WB_1 reference assembly. This is a very large difference to observe between members of the same species. Considering that segmental duplications are often not accurately represented in genome assemblies, there is a strong possibility that some of the variants identified between these new high-quality assemblies and the other assemblies are simply assembly artefacts (failure of recently duplicated segments to be distinguished, etc.). At the very least, this should be addressed in the Discussion.
  
  Line 360-361: Elaborate slightly on what is in the dataset being shared.
  
  Line 420-421: Clarify which of these are human vs animal traits.
  
  Figure 1 A legend: The dots seem to all be the same size, which suggests that this is a scatter plot, not a bubble plot.
  
  Figure 1 C: "across the graph genome" sounds spatial; perhaps "proportion of variant types in the graph genome" would be clearer.
  
  Figure 1 D: It would be helpful to have the rows sorted to match the order in B.
  
  Figure 1 D: The low bars (i.e. small number of shared sites) are not easy to interpret. Perhaps the y-axis could be transformed to log scale or the number of variants could be added to the bars.
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2025.05.04.652079v1
www.biorxiv.org www.biorxiv.org

Comparing Linear and Nonlinear Finite Element Models of Vertebral Strength Across the Thoracolumbar Spine: A Benchmark from Density-Calibrated Computed Tomography

2
1. GigaScience 30 Sep 2025
  
  in GigaScience
  
  AbstractOpportunistic assessment of vertebral strength from clinical computed tomography (CT) scans holds substantial promise for fracture risk stratification, yet variability in calibration methods and finite element (FE) modeling approaches has led to limited comparability across studies. In this work, we provide a publicly available benchmark dataset that supports standardized biomechanical analysis of the thoracic and lumbar spine using density-calibrated CT data. We extended the VerSe 2019 dataset to include phantomless quantitative CT calibration, automated vertebral substructure segmentation, and vertebral strength estimates derived from both linear and nonlinear FE models. The cohort comprises 141 patients scanned across five CT systems, including contrast-enhanced protocols. Phantomless calibration was performed using automatically segmented tissue references and validated against synchronous calibration phantoms in 17 scans. To evaluate model performance, we implemented a nonlinear elastoplastic FE model and compared it to two linear estimates. A displacement-calibrated linear model (0.2% axial strain) demonstrated excellent agreement with nonlinear failure loads (R = 0.96; mean difference = -0.07 kN), while a stiffness-based approach showed similarly strong correlation (R = 0.92). We evaluated vertebral strength at all thoracic and lumbar levels, enabling level-wise normalization and comparison. Strength ratios revealed consistent anatomical trends and identified T12 and T9 as reliable alternatives to L1 for opportunistic screening and model standardization. All calibrated scans, segmentations, software, and modeling outputs are publicly released, providing a benchmark resource for validation and development of FE models, radiomics tools, and other quantitative imaging applications in musculoskeletal research.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf094), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 2: Karan Devane
  
  The study uses an open-source dataset collected in a population representative of those who would benefit from opportunistic screening and included physiological variation (i.e. contrast enhanced images and pre-existing fracture), alongside validation of density and FE assessment calibration methods. The methods are described in detail, including software versioning schemes, and links to the software sources as relevant for use in replicating methods. Additionally, the enhanced dataset is being included alongside the publication. The primary purpose of this study was to prepare and make available a public dataset for use in continued testing and development of opportunistic screening methods. The data appears to be conservatively analyzed as such, and the authors make notes of existing limitations of the population and sample characteristics where applicable. Additionally, the phantomless calibration technique is validated within this dataset prior to use in support of the "generalizability of the approach" (178), though the applied sample for this is relatively small (n=17 with in-scan phantoms). The manuscript is well-written and easy to understand but I have a few suggestions and comments that need to be addressed.
  
  The data are well-controlled for the study cohort, however as mentioned by the authors (228-232), this cohort is biased towards individuals with pre-existing skeletal fragility, as indicated by the average lumbar T-score as assessed by DXA falling in the osteopenic range (-1.5, Table 1). Beyond this, the authors made use of multiple validated calibration techniques to support the use of their internal calibration scheme, as well as analysis of potential confounding variables such as contrast enhanced CT scans. Relative vertebral strength analysis (Figure 6, Table 2), however does not appear to be analyzed with respect to the fractures mentioned as present throughout the cohort (193). While differences in strength may be primarily explained by density or size, it is possible that the incidence of pre-existing fracture occurring in the thoracolumbar segment may influence adaptation of the other vertebrae in the region [1][2][3], and as such analysis for fracture inclusion may be warranted.
  
  The use of standardized FE modeling techniques supports the goal for reproducibility of assessment in clinical FE modeling. While the authors made efforts to enhance the reproducibility and generalizability of the dataset, they themselves note that the source population is not necessarily descriptive of a general population (lines 227-232). Though this population is representative of those indicated for opportunistic screening, the development of risk curves necessitates the inclusion of healthy individuals, and follow-up analysis to fully flesh out the use of opportunistic FE in clinical settings, however this analysis would require a much larger cohort, and are outside the scope of the current manuscript. Further, while 'voxel-models' are typically regarded as standard, tetrahedral element models may generally provide better representation of complex biological geometries [4]. All approaches to FE have drawbacks, and tetrahedral models may be less-optimal solutions compared to hexahedral elements for convergence and the possibility of artificial stiffening, the high prevalence of osteophytes and degradation [5], particularly in older populations where screening is indicated, may warrant the use of tetrahedral elements which capture the intricacies of vertebral geometry that impact FE derived strength [6]. While again potentially outside the scope of this study, it might be noted as an additional formulative variable for FE approaches to estimating fracture risk.
  
  Line 269 -> "… applications such as radiomics-driven [approach?] for opportunistic …" As fracture prevalence is included in the dataset, it may be worthwhile to include analysis of fracture-adjacent vertebra in the selection of surrogate vertebra for L1 in opportunistic screening. Does pre-existing fracture influence which vertebrae selected, and should this decision be made on a person-to-person basis, taking into consideration the particular condition of the vertebrae available in the scan?
  
  [1] https://pmc.ncbi.nlm.nih.gov/articles/PMC8752702/ [2]https://academic.oup.com/jbmr/article/39/12/1744/7825427 [3] https://pmc.ncbi.nlm.nih.gov/articles/PMC7697376/ [4]https://www.sciencedirect.com/science/article/pii/S0021929005003568 [5] https://link.springer.com/article/10.1007/s12565-010-0080-8 [6]https://www.sciencedirect.com/science/article/pii/S1529943018306466
2. GigaScience 30 Sep 2025
  
  in GigaScience
  
  AbstractOpportunistic assessment of vertebral strength from clinical computed tomography (CT) scans holds substantial promise for fracture risk stratification, yet variability in calibration methods and finite element (FE) modeling approaches has led to limited comparability across studies. In this work, we provide a publicly available benchmark dataset that supports standardized biomechanical analysis of the thoracic and lumbar spine using density-calibrated CT data. We extended the VerSe 2019 dataset to include phantomless quantitative CT calibration, automated vertebral substructure segmentation, and vertebral strength estimates derived from both linear and nonlinear FE models. The cohort comprises 141 patients scanned across five CT systems, including contrast-enhanced protocols. Phantomless calibration was performed using automatically segmented tissue references and validated against synchronous calibration phantoms in 17 scans. To evaluate model performance, we implemented a nonlinear elastoplastic FE model and compared it to two linear estimates. A displacement-calibrated linear model (0.2% axial strain) demonstrated excellent agreement with nonlinear failure loads (R = 0.96; mean difference = -0.07 kN), while a stiffness-based approach showed similarly strong correlation (R = 0.92). We evaluated vertebral strength at all thoracic and lumbar levels, enabling level-wise normalization and comparison. Strength ratios revealed consistent anatomical trends and identified T12 and T9 as reliable alternatives to L1 for opportunistic screening and model standardization. All calibrated scans, segmentations, software, and modeling outputs are publicly released, providing a benchmark resource for validation and development of FE models, radiomics tools, and other quantitative imaging applications in musculoskeletal research.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf094), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 1: Maria Prado
  
  The study presents a novel technique that could advance vertebral strength estimations using FE analysis. The authors clearly articulate the motivation for open benchmarking, covering spinal regions (T1-L6) that are not typically included in similar studies. The description and availability of both linear and nonlinear models support the method's broad utility. I value the authors' effort to share data and open-source resources, which enhances reproducibility.
  
  Suggestions are recommended to enhance the manuscript and clarify/expand some sections for future readers.
  
  (Lines 122-132) The justification for choosing 0.2% axial strain as the calibration threshold is somewhat empirical and based on only three representative samples (low, medium, and high vBMD). Please, expand on how representative these three samples are of the entire cohort and whether additional samples were tested to confirm generalizability.
  
  (Line 151-152) The manuscript notes that T12 (+2.2%) and T9 (-2.1%) exhibited the smallest deviation from L1, suggesting their potential as alternative targets. In addition to calculating these deviations, was any further analysis performed to support this conclusion? Consider expanding on whether more extensive validation or simulations would be necessary to robustly support T12 and T9 as substitutes for L1.
  
  (Lines 198-200) The description of cortical bone modeling is vague. It is not clear if the cortical bone was not modeled explicitly, but was implicitly accounted for. Clarification would be appreciated. Additionally, please comment on whether the method leads to under- or overestimation of strength in areas where cortical bone is predominant. Is this a limitation that might impact model predictions?
  
  (Line 314) Is there a specific reason why the posterior elements were included in the segmentation process? Previous studies have often omitted these structures from their models. A brief justification for their inclusion in the present work would be helpful.
  
  (Lines 322-323) Are there any references or prior studies that support the selection of the specific reference tissues used for phantomless calibration?
  
  (Lines 349-356) While equations for modulus and yield stress are provided, a short explanation of how these equations compare to other published models and why they were chosen could be more clearly included.
  
  (Lines 361-373) The explanation of the simulation procedure, while valuable, does not clearly state whether it was performed solely on the L4 vertebra (described as the reference image) or applied individually to each vertebral body. Please clarify this point. Additionally, although the loading and boundary conditions are described, the manuscript lacks detail on how endplate irregularities or variations in vertebral alignment were addressed.
  
  (Line 387) For the failure load calculation using the stiffness-based method, which specific vertebrae were used to measure height? Please clarify whether height measurements were taken from all vertebrae in the cohort, only from those included in the force analysis, or from a subset.
  
  (Lines 397-399) The "graph model" approach for intervertebral strength normalization is not explained in detail. While it appears that this method corresponds to the analysis presented in Figure 6, this connection is not clearly stated in the text.
  
  (Lines 122-144) In the section Linear models approximate nonlinear vertebral strength estimates, it is unclear how the nonlinear model itself was validated. The manuscript does not reference any experimental or literature-based benchmarks to support the accuracy of the nonlinear failure load predictions. Please clarify whether any validation against in vitro or in vivo vertebral failure data was performed or cited. If such validation is lacking, this should be acknowledged as a limitation and discussed in terms of its potential impact on the interpretation of the results.
  
  Minor suggestions:
  
  Terminology: The term "phantomless calibration" is well-used, but a brief definition upfront (in Abstract or Background) would help readers unfamiliar with the concept.
  
  (Line 59) The word "transparent" refers to a clearer modeling workflow?
  
  (Lines 87-89) Consider relocation of the statement ("By providing these outputs, we offer a ready-to-use reference..."), which seems confusing and cuts the flow of the text.
  
  FIGURES: Ensure axis labels, units, and legends in all figures (especially Fig. 4 and Fig. 6) are visible and explained.
  
  FIGURE 3A - C. The subtitle titles could lead to misinterpretation or confusion about what is being described.
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2025.04.19.649449v1
www.biorxiv.org www.biorxiv.org

SPEX: A modular end-to-end platform for high-plex tissue spatial omics analysis

3
1. GigaScience 30 Sep 2025
  
  in GigaScience
  
  Recent advancements in transcriptomics and proteomics have opened the possibility for spatially resolved molecular characterization of tissue architecture with the promise of enabling a deeper understanding of tissue biology in either homeostasis or disease. The wealth of data generated by these technologies has recently driven the development of a wide range of computational methods. These methods have the requirement of advanced coding fluency to be applied and integrated across the full spatial omics analysis process thus presenting a hurdle for widespread adoption by the biology research community. To address this, we introduce SPEX (Spatial Expression Explorer), a web-based analysis platform that employs modular analysis pipeline design, accessible through a user-friendly interface. SPEX’s infrastructure allows for streamlined access to open source image data management systems,analysis modules, and fully integrated data visualization solutions. Analysis modules include essential steps covering image processing, single-cell and spatial analysis. We demonstrate SPEX’s ability to facilitate the discovery of biological insights in spatially resolved omics datasets from healthy tissue to tumor samples.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf090), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 3: Hongyoon Choi
  
  The manuscript introduces SPEX, a web-based platform designed for spatial omics data analysis. The authors highlight its user-friendly UI, modular analysis pipelines, and integration with open-source image data management systems. The platform supports image processing such as cell/nucleus segmentation, clustering, and spatial analysis. GUI-based approaches as well as python script-based modules increase usability for the broader research community. While the goals of the platform are commendable, and the integration of multiple analysis modules is a valuable contribution, there are critical shortcomings in the manuscript that must be addressed before publication. Several key weaknesses significantly limit the scientific rigor and impact of this work.
  
  One of the critical omissions in this manuscript is the lack of rigorous benchmarking against established tools. Though it demonstrated the comparison with other tools such as Squidpy, Giotto, and MC Micro, but there is no quantitative comparison to demonstrate its advantages over existing methodologies. In particular, spatial analysis such as CLQ is introduced as a different approach within the spatial biology analytics framework, but how does it compare to existing co-occurrence analysis methods? Additionally, similar analyses have been conducted using other tools (e.g., Mah, C.K., et al., Genome Biol 25, 82 (2024)), including in 'subcellular' colocalization. In this regard, concerns about its novelty arise. Moreover, as mentioned in relation to Bento, CLQ could also be applied to subcellular analysis?
  
  In this regard, for spatial co-occurence or other algorithms in SPEX, the authors should run identical datasets through both SPEX and existing tools to compare performance and biological insights. it is impossible to assess whether SPEX provides any meaningful improvement over existing platforms.
  
  The cell typing process is one of the most fundamental steps in spatial omics analysis. However, SPEX does not integrate a dedicated cell typing module, forcing users to use another tool or define cell types manually. The accuracy of all downstream analyses (clustering, spatial interaction, pathway analysis) depends on robust and reliable cell typing. It would be better to integrate with automated cell typing solutions to increase usability.
  
  The manuscript focuses almost exclusively on single-cell resolution data and high-dimensional imaging-based methods (e.g., IMC, MIBI, MERFISH). However, spot-based transcriptomics platforms such as Visium are widely used in the field. In this regard, SPEX does not provide modules tailored methodology for spot-based spatial analysis (such as deconvolution) or super-resolution or transforming cell-based analysis from spots (e.g. bin2cell in VisiumHD). Neighborhood analyses or spatially variable gene detection, etc. are specialized in whole-gene covered, spot-based methods, as well, for example.
  
  The manuscript does not clarify whether users can modify or extend the pipeline with custom Python scripts. Describing further this point, customization in this ecosystem with python script, for 'power-users' of this system could be helpful.
  
  The biological relevance of the SPEX platform remains unclear, as the case studies presented are not sufficiently rigorous. As mentioned above, comparisons with other tools based on quantification can clarify why SPEX is better than other published tools/ecosystems in certain aspect. Or meaningful biological findings and explanations based on this tool as a case study could be helpful. While the results demonstrate technical capabilities, the manuscript does not show how SPEX enables novel biological discoveries compared to existing tools.
2. GigaScience 30 Sep 2025
  
  in GigaScience
  
  Recent advancements in transcriptomics and proteomics have opened the possibility for spatially resolved molecular characterization of tissue architecture with the promise of enabling a deeper understanding of tissue biology in either homeostasis or disease. The wealth of data generated by these technologies has recently driven the development of a wide range of computational methods. These methods have the requirement of advanced coding fluency to be applied and integrated across the full spatial omics analysis process thus presenting a hurdle for widespread adoption by the biology research community. To address this, we introduce SPEX (Spatial Expression Explorer), a web-based analysis platform that employs modular analysis pipeline design, accessible through a user-friendly interface. SPEX’s infrastructure allows for streamlined access to open source image data management systems,analysis modules, and fully integrated data visualization solutions. Analysis modules include essential steps covering image processing, single-cell and spatial analysis. We demonstrate SPEX’s ability to facilitate the discovery of biological insights in spatially resolved omics datasets from healthy tissue to tumor samples.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf090), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 2: Qianqian Song
  
  The manuscript presents an advancement in spatial omics analysis but needs improvements in Quantitative benchmarking, Computational scalability assessment, etc. With these revisions, SPEX has the potential to become a widely adopted platform in the spatial omics community. I have specific comments as below:
  
  1) While the manuscript provides a qualitative comparison of SPEX with other spatial omics tools (e.g., Squidpy, Giotto, Aquilla), quantitative benchmarking is missing. It is needed to include a performance benchmark comparing runtime efficiency, segmentation accuracy, and clustering resolution against existing tools. Also, it is necessary to show computational efficiency metrics (e.g., memory usage, execution time, scalability across datasets of varying sizes).
  
  2) The study presents compelling results, but there is no independent validation or interpretation of computational outputs using experimental methods.
  
  3) The manuscript does not discuss hardware requirements, processing speed, or computational limitations. It is needed to provide an assessment of SPEX's performance on different computing environments (e.g., local workstations vs. cloud computing vs. high-performance clusters).
  
  4) The Colocation Quotient (CLQ) method is well described, but the manuscript does not provide statistical validation (e.g., p-values, confidence intervals) for detected spatial relationships.
3. GigaScience 30 Sep 2025
  
  in GigaScience
  
  Recent advancements in transcriptomics and proteomics have opened the possibility for spatially resolved molecular characterization of tissue architecture with the promise of enabling a deeper understanding of tissue biology in either homeostasis or disease. The wealth of data generated by these technologies has recently driven the development of a wide range of computational methods. These methods have the requirement of advanced coding fluency to be applied and integrated across the full spatial omics analysis process thus presenting a hurdle for widespread adoption by the biology research community. To address this, we introduce SPEX (Spatial Expression Explorer), a web-based analysis platform that employs modular analysis pipeline design, accessible through a user-friendly interface. SPEX’s infrastructure allows for streamlined access to open source image data management systems,analysis modules, and fully integrated data visualization solutions. Analysis modules include essential steps covering image processing, single-cell and spatial analysis. We demonstrate SPEX’s ability to facilitate the discovery of biological insights in spatially resolved omics datasets from healthy tissue to tumor samples.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf090), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 1: Ka Yee Yeung
  
  Li et al. presented SPEX (Spatial Expression Explorer), a web-based open-source end-to-end analysis platform offering modular design and a user accessible interface. The users demonstrated use cases in spatial transcriptomics (MERFISH lung cancer) and spatial proteomics datasets (tonsil, public multiplex ion beam imaging data). SPEX includes the following analytical modules 1. image processing modules includes a 4-step sequence (image pre-processing, single-cell segmentation, post-processing, feature selection). Image loading supports OMERO integration. Output is a cell by expression matrix in Anndata format. 2. clustering modules for both spatial transcriptomic and proteomic data. 3. spatial analysis module implements the CLQ (Colocation Quotient) method. 4. spatial expression analysis module includes differential expression and pathway analysis. SPEX supports visualization via Vitessce.
  
  The paper is well written, addresses a rising interest and critical need in the biomedical community. The reviewer would like to request clarifications on how extensible the modules are. The author mentioned a SPEX pipeline builder in which "modules are selected from a library and dragged into a visual pipeline map", and also mentioend the support for "flexible plug-in analysis modules". What are the packages available from the library? Can users import their own code or script or package? How to create new plug-in's?
  
  The reviewer is also wondering how do the users interact with the results? Can the user click on the resulting image and select regions of interest to zoom in?
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2022.08.22.504841v1
www.biorxiv.org www.biorxiv.org

Aedes mosquito distribution across urban and peri-urban areas of Kinshasa city, Democratic Republic of Congo

2
1. GigaScience 17 Sep 2025
  
  in GigaByte
  
  Editors Assessment:
  
  In the Democratic Republic of Congo (DRC) Aedes mosquitoes are principal vectors of the arboviruses that cause yellow fever, chikungunya and dengue in the human population. However systematic surveillance data on these species remains limited, hindering for entomological and modelling research and control strategies. This paper is one of a series of Data Release papers in GigaByte supported by TDR and the WHO describing datasets hosted in GBIF to tackle these data gaps in vectors of human disease data. To address this data deficiency this paper presents a geo-referenced dataset of 6,577 entomological occurrence records collected in 2024 throughout urban and peri-urban areas of Kinshasa in the Democratic Republic of Congo. The data collected using Larval dipping, Human landing catches, Prokopack aspirator, and BG-Sentinel traps. Data auditing and peer review found the data well validated, but requested some additional fields and methodological details. This work and the extremely useful data provided representing an important step towards building a pan-African resource for Aedes mosquito data collection.
  
  This evaluation refers to version 1 of the preprint
  
  Summary
2. GigaScience 17 Sep 2025
  
  in GigaByte
  
  AbstractIn the Democratic Republic of Congo (DRC) Aedes mosquitoes are principal vectors of medically important arboviruses, with major implications for yellow fever, chikungunya and dengue. However, systematic surveillance of these species remains limited, constrained by competing public health priorities such as malaria and other neglected tropical diseases. This gap in surveillance prevents the rapid detection of changes in the distribution, abundance and behaviour, particularly in rapidly urbanizing environments where breeding habitats are proliferating and ecological conditions are favourable for the establishment of these vectors. To address this gap, spatially explicit, small-scale data on Aedes populations in urban and peri-urban areas are needed to accurately assess transmission risk and develop targeted, evidence-based vector control strategies. Here, we present a geo-referenced dataset of 6,577 entomological occurrence records collected in 20224 throughout urban and peri-urban areas of Kinshasa city, DRC, using Larval dipping, Human landing catches, Prokopack aspirator, and BG-Sentinel traps. Records include Aedes albopictus (n = 2,694), Aedes aegypti (n = 1939), Aedes vittatus (n = 2), and Aedes spp. (n = 1,942), each annotated with species, sex, life stage, reproductive status, and spatial coordinates. The dataset is published as a Darwin Core archive in the Global Biodiversity Information Facility (GBIF), and represents the most detailed, spatially explicit record of Aedes mosquito occurrence in Kinshasa to data, providing a robust foundation for entomological and modelling research to support data driven arbovirus vector control strategies in DRC.
  
  Reviewer 1. Bastien Molcrette
  
  Are all data available and do they match the descriptions in the paper?
  
  Correction needed in manuscript Table 1: row ‘Ae. spp (*unid)’ column ‘total’ should be 1942 (instead of 1932). Additional Comments: Aedes vittatus has only been observed and characterized twice in a full year, among 6577 samples: how confident are you that these samples have been correctly classified? Are there any other references for the observation of Aedes vittatus around Kinshasa?
  
  The full data review and audit is here: https://gigabyte-review.rivervalleytechnologies.com/journal/gx/download-files?YXJ0aWNsZV9pZD02NDAmZmlsZT0yODAmdHlwZT1nZW5lcmljJnZpZXc9dHJ1ZQ==
  
  Reviewer 2. Paul Taconet
  
  Is the data acquisition clear, complete and methodologically sound?
  
  No. See attached.
  
  Is there sufficient detail in the methods and data-processing steps to allow reproduction?
  
  No. See below.
  
  Additional Comments: This data paper presents a valuable contribution, and the effort invested in publishing such a dataset is both commendable and highly appreciated. It represents an important step towards building a pan-African resource for Aedes mosquito data collection.
  
  Overall, the paper and dataset are highly promising, but clarifying the sampling design and improving metadata consistency will significantly enhance their usability and scientific value.
  
  Major comments:
  
  The main point of confusion concerns the geographical definition of the sampling sites. In the manuscript, it is stated that “within each area, two sampling sites were selected.” This suggests a total of four sampling sites (2 areas × 2 sites each). However, elsewhere the text mentions “adults collected from different households for each of the three sampling techniques,” which implies three households per area (i.e., three sites).
  
  In contrast, the dataset appears to include only two sampling points (one per area), each with extremely precise geographic coordinates (six decimal places, implying sub-meter accuracy). This suggests that collections were made at identical locations, contradicting the description in the paper (two sites, multiple households, etc.).
  
  To resolve this inconsistency, clarification is needed both in the paper and in the dataset:
  
  In the manuscript, explicitly state the number of sampling sites used for each protocol.
  
  In the dataset, either provide the true coordinates or specify the level of spatial accuracy. This could be achieved by adding a column such as coordinatePrecision, coordinateUncertaintyInMeters, or footprintWKT in the event table (see: https://dwc.tdwg.org/list/2020-10-13#dwc_coordinatePrecision, https://dwc.tdwg.org/list/#dwc_coordinateUncertaintyInMeters, https://dwc.tdwg.org/list/#dwc_footprintWKT). Such clarification is essential.
  
  Minor comments (manuscript):
  
  In the “Mosquito collection” section, please provide more detail about the sampling schedule (e.g., total number of sessions for each technique, average sampling frequency, etc.).
  
  In Table 2, define precisely how dry and rainy seasons were determined (e.g., based on calendar months or rainfall thresholds or other).
  
  The dataset contains information on mosquito sex and feeding status, yet the paper does not describe how these were determined. Please add methodological details.
  
  Indicate how far apart the sampled households were located, since simultaneous sampling at nearby sites could bias results.
  
  Typographical corrections:
  
  Introduction: “entomological occurrence records collected in 20224 2024” → revise.
  
  Introduction: “spatially explicit record of Aedes mosquito occurrence in Kinshasa to data date” → revise.
  
  Methods: “Water from each breeding sites was using with a ladle...” → revise wording for clarity.
  
  Comments on the dataset:
  
  For completeness, the event table could include additional fields such as habitat, samplingEffort (especially relevant for adult collection), sampleSizeValue, and sampleSizeUnit. These details are already provided in the paper and could easily be added to the GBIF dataset.
  
  In the occurrence table, the entries under ScientificName are currently generic (e.g., “Aedes albopictus” should be written as Aedes albopictus (Skuse, 1895)). Consider renaming the current column as genericName and adding a proper ScientificName column with complete taxonomic names.
  
  The use of MaterialSample as the basisOfRecord seems questionable. According to community discussions (e.g., https://discourse.gbif.org/t/understanding-basis-of-record/5857), HumanObservation would be more appropriate in this case.
Visit annotations in context

Tags

Summary

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2025.09.03.674006v1
www.biorxiv.org www.biorxiv.org

Whole Genome Sequencing and Assembly of the House Sparrow, Passer domesticus

2
1. GigaScience 02 Sep 2025
  
  in GigaByte
  
  Editors Assessment:
  
  This paper presents present the genome sequencing of the house sparrow (Passer domesticus) carrying out genome assembly and annotation using in silico approaches with tools that could be a valuable resource for understanding passerine evolution, biology, ethnology, geography, and demography. The final genome assembly was generated using short read sequencing and a computational workflow that included Shovill, SPAdes, MaSuRCA, and BUSCO benchmarking. Producing a 922 MB reference genome with 24,152 genes. The first draft was significantly smaller than this but peer review provided suggestions on how to improve the assembly quality. And after a few attempts and assembly with a reasonable size and BUSCO score was achieved. This openly available data potentially serving as a valuable resource for checking adaptation, divergence, and speciation of birds.
  
  This evaluation refers to version 2 of the preprint
  
  Summary
2. GigaScience 02 Sep 2025
  
  in GigaByte
  
  AbstractThe common house sparrow, Passer domesticus is a small bird belonging to the family Passeridae. Here, we provide high-quality whole genome sequence data along with assembly for the house sparrow. The final genome assembly was assembled using a Shovill/SPAdes/MASURCA/BUSCO workflow, consisting of contigs spanning 268193 bases and coalescing around a 922 MB sized reference genome. We employed rigorous statistical thresholds to check the coverage, as the Passer genome showed considerable similarity to Gallus gallus (chicken) and Taeniopygia guttata (Zebra finch) genomes, also providing a functional annotation. This new annotated genome assembly will be a valuable resource as a reference for comparative and population genomic analyses of passerine, avian, and vertebrate evolution.Significance Avian evolution has been of great interest in the context of extinction. Annotating the genomes such as passerines would be of significant interest as we could understand the behavior/foraging traits and further explore their evolutionary landscape. In this work, we provide a full genome sequence of Indian house sparrow, viz. Passer domesticus which will serve as a useful resource in understanding the adaptability, evolution, geography, allee effects and circadian rhythms.
  
  This work has been published in GigaByte Journal under a CC-BY 4.0 license (https://doi.org/10.46471/gigabyte.161), and has published the reviews under the same license.
  
  Reviewer 1. Gang Wang
  
  Is the language of sufficient quality? Yes. There are many details in the article, such as citation format, spelling, etc. [Supplementary Table 3a, 3b, 3c) → (Supplementary Table 3a, 3b, 3c) The citation format of the article also needs to be adjusted according to the journal requirements.
  
  Is there sufficient detail in the methods and data-processing steps to allow reproduction? No. A previous reviewer mentioned that RagTag could be used to improve the quality of genome assembly. I suggest you seriously consider this.
  
  Is there sufficient information for others to reuse this dataset or integrate it with other data? No
  
  Overall Comments: The article is logically clear and the analysis is complete. The description of both sample collection and sequencing is relatively clear. At the same time, the analysis process shown in Figure 1 is also very reasonable. However, as described by the previous reviewer, I suggest that you remove the high-quality level. There are many details in the article, such as citation format, spelling, etc. [Supplementary Table 3a, 3b, 3c) → (Supplementary Table 3a, 3b, 3c) The citation format of the article also needs to be adjusted according to the journal requirements. Figure 2, the letters of a and b are too different, please unify them. Figure 4 is completely unclear, please increase the font size. A previous reviewer mentioned that RagTag could be used to improve the quality of genome assembly. I suggest you seriously consider this. Re-review: The authors used FCS-GX to exclude contaminating sequences in the genome, so I agree that this paper should be published.
  
  Reviewer 2. Agustin Ariel Baricalla
  
  Are all data available and do they match the descriptions in the paper? No. Matching data: NCBI project with access to the NCBI-SRA deposited raw data. Nonmatching data: Oxford Nanopore data: The authors reply to a previously submitted manuscript arguing that this data was not used, but Fig. 1 refers to Nanopore Minion data. The manuscript body and the additional data section do not include the Quast and BUSCO reports or their corresponding plots.
  
  Are the data and metadata consistent with relevant minimum information or reporting standards? See GigaDB checklists for examples http://gigadb.org/site/guide No. GigaByte suggests a checklist including the genome, CDS, and proteins in FASTA format, as well as the annotations in GFF format; however, these items are not available for evaluation.
  
  Is there sufficient detail in the methods and data-processing steps to allow reproduction? Yes. The FastP step for raw data processing is mentioned in the results section but is not detailed in the methods section.
  
  Is there sufficient data validation and statistical analyses of data quality? No. The authors have not included the BUSCO results. The OrthoDB database for 'passeriformes_odb12' contains over 10,000 curated genes, representing approximately 50-60% of the total genes in a typical passeriform genome. Therefore, the BUSCO report for the new assembly should be provided. The author mentioned that "The gene completeness for Passer was assessed through Benchmarking Universal Single-Copy Orthologs ( Busco version 5.5.0 ) [26] by using the orthologous genes in the Gallus gallus [ chicken] genome" but BUSCO uses the OrthoDB datasets to run, I do not understand what this phrase refers to.
  
  Is there sufficient information for others to reuse this dataset or integrate it with other data? Yes. All the procedures are consistent and the programs or pipelines are well-known and well documented in the bioinformatic and genomic fields.
  
  Additional Comments: The inclusion of the mitochondrial genome represents a significant improvement in this manuscript. I recommend presenting all nuclear results together first, followed by a separate and clear description of the mitochondrial analysis and findings to enhance clarity. The data is interesting for analyzing the genetic dynamics behind Passer domesticus adaptation and evolution and can show differences between the previous genomes available from a European reference sample but this is not presented in this work. As of this revision, the NCBI's Passer domesticus genome includes two European reference genomes, both classified with 'chromosome-like' status (NCBI: GCF_036417665.1 and GCA_001700915.1). These genomes can be utilized in two distinct ways: (1) performing a 'genome-guided assembly' with MASURCA, using one of these genomes alongside the Illumina data, or (2) conducting genome scaffolding by employing one of these genomes as a reference and the assembled genome from raw reads as a query, using tools like RagTag or the chromosome scaffolder available in MASURCA. Both approaches could potentially lead to improvements in scaffold number and contiguity metrics, such as N50, N90, and the largest scaffold.
  
  Re-review: The authors have subtly improved the original version previously presented, but have not managed to surpass the minimum standards established by the publisher to be published by the journal. Easily achievable changes have been requested to complement the analysis previously made and have been ignored. Requests have not been answered, graphics that generate confusion between them and the text presented have not been fixed, and no relevant improvement between the previous and current versions has been shown.
Visit annotations in context

Tags

Summary

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2023.11.04.565608v3
Aug 2025
www.biorxiv.org www.biorxiv.org

Telomere-to-telomere African wild rice (Oryza longistaminata) reference genome reveals segmental and structural variation

2
1. GigaScience 26 Aug 2025
  
  in GigaScience
  
  AbstractRice (Oryza sativa) is one of the most important staple food crops worldwide, and its wild relatives serve as an important gene pool in its breeding. Compared with cultivated rice species, African wild rice (Oryza longistaminata) has several advantageous traits, such as resistance to increased biomass production, clonal propagation via rhizomes, and biotic stresses. However, previous O. longistaminata genome assemblies have been hampered by gaps and incompleteness, restricting detailed investigations into their genomes. To streamline breeding endeavors and facilitate functional genomics studies, we generated a 343-Mb telomere-to-telomere (T2T) genome assembly for this species, covering all telomeres and centromeres across the 12 chromosomes. This newly assembled genome has markedly improved over previous versions. Comparative analysis revealed a high degree of synteny with previously published genomes. A large number of structural variations were identified between the O. longistaminata and O. sativa. A total of 2,466 segmentally duplicated genes were identified and enriched in cellular amino acid metabolic processes. We detected a slight expansion of some subfamilies of resistance genes and transcription factors. This newly assembled T2T genome of O. longistaminata provides a valuable resource for the exploration and exploitation of beneficial alleles present in wild relative species of cultivated rice.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf074), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 2: Chengzhi Liang
  
  The authors generated a 343-Mb telomere-to-telomere (T2T) genome assembly for an African wild rice (Oryza longistaminata), covering all telomeres and centromeres across the 12 chromosomes, and performed genome annotation and analyses on structural variations and NLR genes. While the manuscript has provided a valuable genome sequence, several problems should be addressed before the manuscript can be published.
  
  Major issues 1. The authors estimated that the genome heterozygosity is 1.27%, which is quite high, so I am wondering how large the assembled genome size is using only HiFi data, which could reflect the actual heterozygosity rate of the genome, particularly by comparing it with the final genome size of 12 chromosomes. If there was only one gap in the initial assembly of Hifiasm (a total of 13 contigs), it is unlikely that the genome has such a high heterozygosity. In Table 1, the total size of assembled genome was 331,045,917bp. If this is the summed size of 12 chromosomes, it should be used as the final genome size in the main text. Please clarify. Also, what is the base accuracy of Ultra-long CycloneSEQ data? which is useful to readers for this is a new sequencing technology. 2. For SV detection, considering that the assembled genome in the manuscript (does it have a accession ID or name?) is an African wild rice, it is rather strange that the authors did not compare it with an O. glaberrima genome, but with an O. sativa genome. Meanwhile, the name of the genomes should be mentioned since there were so many different genomes in each species, all with different SV variations between them. 3. The conclusion that "This distribution suggests that chromosomes 1, 4, 3, and 2 might have contributed to the evolution of rice in previously unrecognized ways (Table S8)" is purely speculative, and thus should be removed from the manuscript, or the authors should provide more evidence to support it. 4. The author claimed that "Compared with other Oryza species, O. longistaminata has many fewer NBS-lRR domain genes, which reflects a contraction of resistance genes in this species." Please give specific gene numbers for each species. Meanwhile, the conclusion does not look right here since it looks that O. longistaminata had more NBS-LRR genes than other species.
  
  Minor issues 1. What is "quartets"? 2. The author used "11 Oryza species" which included O. indica, please clarify what this species is.Bold
2. GigaScience 26 Aug 2025
  
  in GigaScience
  
  AbstractRice (Oryza sativa) is one of the most important staple food crops worldwide, and its wild relatives serve as an important gene pool in its breeding. Compared with cultivated rice species, African wild rice (Oryza longistaminata) has several advantageous traits, such as resistance to increased biomass production, clonal propagation via rhizomes, and biotic stresses. However, previous O. longistaminata genome assemblies have been hampered by gaps and incompleteness, restricting detailed investigations into their genomes. To streamline breeding endeavors and facilitate functional genomics studies, we generated a 343-Mb telomere-to-telomere (T2T) genome assembly for this species, covering all telomeres and centromeres across the 12 chromosomes. This newly assembled genome has markedly improved over previous versions. Comparative analysis revealed a high degree of synteny with previously published genomes. A large number of structural variations were identified between the O. longistaminata and O. sativa. A total of 2,466 segmentally duplicated genes were identified and enriched in cellular amino acid metabolic processes. We detected a slight expansion of some subfamilies of resistance genes and transcription factors. This newly assembled T2T genome of O. longistaminata provides a valuable resource for the exploration and exploitation of beneficial alleles present in wild relative species of cultivated rice.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf074), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 1: Francois Sabot
  
  The manuscript from Guang et al deals with a T2T assembly for the wild perennial African rice Oryza longistaminata. Using last up to date technologies and approaches, authors provided a high quality assembly for this wild species, rending it a valuable ressource for understanding rice evolution. While the results as assembly are of high quality, the interpretation of some biological results, in particular about the NBS-LRR, are quite weird, in my opinion, and need to be more refined. That's why I think the manuscript should be published, but after major corrections.
  
  in details:
  
  -Introduction: not sure the exceptional biomass is a good idea from longistaminata, as this plant has avery high content in silicium, rendering its biomass complex to use. - Methods: We do not have access to most of the command options and command-lines. please provide them at least as a texte file in supp data. In addition, some of the references for tools are missing. Finally, please provide the accession number of the assembled plant. - Assembly in itself: O longistaminata is a outcrossing heterozygous organism. Did you obtained the two haplotypes ? - Comparison with the previous longistaminata genome: is the inversion in middle of Chr6 specific ? or due to an error of previous assembly ? - Table 1: what do you mean "Total size of assembled genomes (bp) 331,045,917" ? What is the residual percentage of N ? - Figure 1 and others: please show the legend in other way, here we may mix it with the main text. in addition, check the legends for spelling and the size of figure (3b eg) for lisibility - Syri/MUMmer analysis: you limit as min size at 1kb ? What was the order of query vs ref ? can we have a bed file with the positions ? - SD: is there a statistical link between chromosome size and number of SD ? It could explain why the first 4 ones have more SD. In general, the data are missing stats. - GO in SD: any statistical validation ? - Genomes comparison: please provide the acc number of the genome you used for comparison. - NBS-LRR: the longistaminata genome has 215 genes for 116 to 289 for other oryza so I cannot see any contraction or expansion. in addition, the text here is weird, starting speaking of onctraction then going to expansion ??? - TF analysis; the african assemblies are quite bad I think, explaining the discrepency. For glaberrima, did you check the one from Tranchant-Dubreuil et al, 2023 ?
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2024.09.05.611405v1
www.biorxiv.org www.biorxiv.org

A telomere to telomere phased genome assembly and annotation for the Australian central bearded dragon Pogona vitticeps

2
1. GigaScience 26 Aug 2025
  
  in GigaScience
  
  AbstractBackground The central bearded dragon (Pogona vitticeps) is widely distributed in central eastern Australia and adapts readily to captivity. Among other attributes, it is distinctive because it undergoes sex reversal from ZZ genotypic males to phenotypic females at high incubation temperatures. Here, we report an annotated telomere to telomere phased assembly of the genome of a female ZW central bearded dragon.Results Genome assembly length is 1.75 Gbp with a scaffold N50 of 266.2 Mbp, N90 of 28.1 Mbp, 26 gaps and 42.2% GC content. Most (99.6%) of the reference assembly is scaffolded into 6 macrochromosomes and 10 microchromosomes, including the Z and W microchromosomes, corresponding to the karyotype. The genome assembly exceeds standard recommended by the Earth Biogenome Project (6CQ40): 0.003% collapsed sequence, 0.03% false expansions, 99.8% k-mer completeness, 97.9% complete single copy BUSCO genes and an average of 93.5% of transcriptome data mappable back to the genome assembly. The mitochondrial genome (16,731 bp) and the model rDNA repeat unit (length 9.5 Kbp) were assembled. Male vertebrate sex genes Amh and Amhr2 were discovered as copies in the small non-recombining region of the Z chromosome, absent from the W chromosome.This, coupled with the prior discovery of differential Z and W transcriptional isoform composition arising from pseudoautosomal sex gene Nr5a1, suggests that complex interactions between these genes, their autosomal copies and their resultant transcription factors and intermediaries, determines sex in the bearded dragon.Conclusion This high-quality assembly will serve as a resource to enable and accelerate research into the unusual reproductive attributes of this species and for comparative studies across the Agamidae and reptiles more generally.Species Taxonomy Eukaryota; Animalia; Chordata; Reptilia; Squamata; Iguania; Agamidae; Amphibolurinae; Pogona; Pogona vitticeps (Ahl, 1926) (NCBI:txid103695).
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf085), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 2: Yuan Li
  
  The authors de novo assembled a telomere to telomere phased genome assembly of the Australian central bearded dragon Pogona vitticeps, using PacBio HiFi, ONT, HiC, and Illumina sequencing platforms. The assembly achieves remarkable contiguity (scaffold N50: 266.2 Mb) and completeness (97.9% BUSCO score), surpassing Earth Biogenome Project standards. The phased assembly of sex chromosomes (Z/W) and identification of candidate sex-determining genes (Amh, Amhr2, and Nr5a1) provide valuable insights into reptilian sex determination. Overall, the study is well-executed and provides a valuable resource for comparative genomics and reproductive biology.
  
  Major concern: 1.The description of read depth had errors at lines 401-402, such as 60.6x. In addition, "4 x promethION", "2x150 bp" were should be revised and please check and revise all the similar description in the manuscript. 2.There are errors in the citation format of the journal references, such as the absence of punctuation "."marks between the title name and the journal name at lines 1005-1009, mixing abbreviations (e.g., "PNAS" vs. "Proceedings of the National Academy of Sciences USA") (lines 988-990, 1005-1009). Please check carefully the format of all references. 3.The script "calculateGC.py and processtrftelo.py" (lines 242 and 245) are mentioned without code availability or parameter details. Provide effective links or repository access. 4.The inconsistent use of "Gb" and "Gbp" is observed; it is recommended to adopt a unified description. 5.Units were missing in the descriptions in multiple places in Table 1 and 2, such as the unit for "Total Bases" and "Assembly length"; please include them. 6.At lines 683-687, the conclusion that Amh/Amhr2 are sex-determining genes relies solely on positional evidence. Discuss the need for functional studies (e.g., CRISPR knockouts) to strengthen claims. 7.There were errors in "Vasimuddin et al. 2019" (line 238) and "Danecek et al. 2021" (line 239). Please check all the other formats of references. 8.At lines 476-481, BAC mappings are cited as validation but lack visual evidence (e.g., alignment plots in figures or supplements). Please verify the accuracy of Figure 7 at line 478, as it does not correspond with the description.
2. GigaScience 26 Aug 2025
  
  in GigaScience
  
  AbstractBackground The central bearded dragon (Pogona vitticeps) is widely distributed in central eastern Australia and adapts readily to captivity. Among other attributes, it is distinctive because it undergoes sex reversal from ZZ genotypic males to phenotypic females at high incubation temperatures. Here, we report an annotated telomere to telomere phased assembly of the genome of a female ZW central bearded dragon.Results Genome assembly length is 1.75 Gbp with a scaffold N50 of 266.2 Mbp, N90 of 28.1 Mbp, 26 gaps and 42.2% GC content. Most (99.6%) of the reference assembly is scaffolded into 6 macrochromosomes and 10 microchromosomes, including the Z and W microchromosomes, corresponding to the karyotype. The genome assembly exceeds standard recommended by the Earth Biogenome Project (6CQ40): 0.003% collapsed sequence, 0.03% false expansions, 99.8% k-mer completeness, 97.9% complete single copy BUSCO genes and an average of 93.5% of transcriptome data mappable back to the genome assembly. The mitochondrial genome (16,731 bp) and the model rDNA repeat unit (length 9.5 Kbp) were assembled. Male vertebrate sex genes Amh and Amhr2 were discovered as copies in the small non-recombining region of the Z chromosome, absent from the W chromosome.This, coupled with the prior discovery of differential Z and W transcriptional isoform composition arising from pseudoautosomal sex gene Nr5a1, suggests that complex interactions between these genes, their autosomal copies and their resultant transcription factors and intermediaries, determines sex in the bearded dragon.Conclusion This high-quality assembly will serve as a resource to enable and accelerate research into the unusual reproductive attributes of this species and for comparative studies across the Agamidae and reptiles more generally.Species Taxonomy Eukaryota; Animalia; Chordata; Reptilia; Squamata; Iguania; Agamidae; Amphibolurinae; Pogona; Pogona vitticeps
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf085), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 1: Heiner Kuhl
  
  Patel et al. present a genome assembly of the bearded dragon Pogona vitticeps a lizard species that is widely distributed as a pet and known for its interesting sex-determination, which may switch from genetic sex-determination (ZW) to temperature dependent sex-reversal. The methods chosen to assemble the genome are very state-of-the-art including HIFI and ONT long reads, Hi-C and suitable bioinformatic tools.
  
  I have to admit that I have recently been reviewing a similar manuscript for Gigascience (https://www.biorxiv.org/content/10.1101/2024.09.05.611321v1), where a female ZZ P. vitticeps had been sequenced/assembled from long read data of a different nanopore technology and analyses of the ZW-chromosome was done by short read coverage analysis. One of my major comments was that this approach lacked a true assembly of the W-chromosome. Thus, I am happy to see that the assembly of the W-specific region has been achieved here and the sequencing technologies used might even improve the assembly quality over the ZZ assembly in terms of phasing, consensus accuracy etc. The two manuscripts are highly complementary and I think they should be published, if possible, in the very same issue of Gigascience. Surely both groups have invested a lot of efforts. (Reading L. 685, I just have realized that this seems to be the intention of the journal and I very much support this idea.)
  
  Still there are some minor points that need improvement for the current manuscript:
  
  Why do you leave the Z and W splitted into PAR, Z- and W-specific scaffolds and do not assemble the full-length chromosomes (L. 676)? Would the Hi-C data not support that?
  
  Mitochondrial assembly: from ONT only (L. 307), please do a consensus correction with illumina data, or at least show that the MT assembly has a high consensus accuracy (Q40-Q50).
  
  Genome annotation: show BUSCO scores for annotated proteins (do they fit to BUSCO performed on the whole genome?). If possible, compare to results of the NCBI RefSeq annotation (is it already available?). In this regard please explain the relatively low mapping rates (L. 647) of RNAseq to the annotated sequences.
  
  Could you provide some expression data for the Z-specific Amh and AmhR2? Is it differentially expressed in testis/ovary (after correction for copy number)?
  
  Table1, could you show results for the two different ONT library types (ligation vs. ultralong kit). It seems the overall yield was low (5 cells -> 100Gb), any speculation why?
  
  I think assembly statistics (Table2) should also contain contig N50 length as an additional value to show the high continuity of the assembly.
  
  L. 488: "48.36 (1 error in 146kb)", I think something is wrong here. Q48.36 would be 1 error in 68.5kb. I would suggest to re-check these values and incorporate them in Table2. The high consensus accuracy is one selling point compared to the competitor's assembly.
  
  L. 490: "Individual haplotypes were 85.5% complete…". Explain why you are confident that the haplotypes are more complete than the Merqury results suggest (just one sentence).
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2025.05.01.651798v1
www.biorxiv.org www.biorxiv.org

A near-complete genome assembly of the bearded dragon Pogona vitticeps provides insights into the origin of Pogona sex chromosomes

2
1. GigaScience 26 Aug 2025
  
  in GigaScience
  
  AbstractBackground The agamid dragon lizard Pogona vitticeps is one of the most popular domesticated reptiles to be kept as pets worldwide. The capacity of breeding in captivity also makes it emerging as a model species for a range of scientific research, especially for the studies of sex chromosome origin and sex determination mechanisms.Results By leveraging the CycloneSEQ and DNBSEQ sequencing technologies, we conducted whole genome and long-range sequencing for a captive-bred ZZ male to construct a chromosome-scale reference genome for P. vitticeps. The new reference genome is ∼1.8 Gb in length, with a contig N50 of 202.5 Mb and all contigs anchored onto 16 chromosomes. Genome annotation assisted by long-read RNA sequencing greatly expanded the P. vitticeps lncRNA catalog. With the chromosome-scale genome, we were able to characterize the whole Z sex chromosome for the first time. We found that over 80% of the Z chromosome remains as pseudo-autosomal region (PAR) where recombination is not suppressed. The sexually differentiated region (SDR) is small and occupied mostly by transposons, yet it aggregates genes involved in male development, such as AMH, AMHR2 and BMPR1A. Finally, by tracking the evolutionary origin and developmental expression of the SDR genes, we proposed a model for the origin of P. vitticeps sex chromosomes which considered the Z-linked AMH as the master sex-determining gene.Conclusions Our study provides novel insights into the sex chromosome origin and sex determination of this model lizard. The near-complete P. vitticeps reference genome will also benefit future study of amniote evolution and may facilitate genome-assisted breeding.Competing Interest StatementThe authors have declared no competing interest.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf079), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 1: Heiner Kuhl
  
  Guo et al. present a new reference genome for Pogona vitticeps, a widespread reptile model organism that is also common as a domestic animal worldwide. The genome assembly shows much improvement over an older assembly from 2017. There are two points that make this manuscript outstanding from common genome assembly papers:
  
  The authors find a new sex determination locus in this species.
  
  the authors use a new nanopore sequencing technology ("CycloneSEQ"), which has so far only described in a preprint (https://www.biorxiv.org/content/10.1101/2024.08.19.608720v1).
  
  In my opinion this deserves a publication in Gigascience, but both points must be focused more in a revised manuscript.
  
  Major comments:
  
  1) The authors have sequenced a male individual (ZZ), which means the long-read reference assembly is missing the W-chromosome. PAR and SDR regions are deduced from the Z sequence, by analysis of sequencing coverage of only a few sexed samples (2 females and 4 males). It is unclear if these individuals are from the same family, which could mean that the newly found SD-region could just be a family specific variation. To make the whole story more intriguing and statistical sound the authors should at least test 15 males and 15 females from different P. vitticeps populations for W-specific markers near the proposed AMH deletion. The authors should also show that the prior proposed SD locus (nr5a1) does not carry W-specific mutations in these 15+15 individuals. Furthermore, a phased assembly of a female (ZW) Pogona vitticeps individual, could enable the assembly of the missing W-chr and should be included, it would even improve analysis of W-specific sequences in the proposed additional individuals.
  
  2) A technology aware reader would like to see more information on the specifics of the CycloneSEQ data quality and handling and maybe a comparison to competing technologies. Which enzymes and buffers were used to prepare the library? In the sections on the methods, there are only superficial descriptions such as (DNA repair buffer/enzyme, DNA clean beads, wash buffer for long fragments). Is it a kit or were the enzymes and buffers purchased individually? I cannot find the procedure for preparation and sequencing of the long-read cDNA libraries. How many flowcells were needed to generate the different datasets? How do the read-length distributions look like (statistics over all reads not only selected 40Kb+)? How was the variability between those runs, especially culmulative output over time? What hardware was needed to run the basecalling and what was the runtime? How is the Q-Value distribution of the reads? Why is the consensus accuracy of the assembly low (Q36.4)? can it be improved? Typically reference quality genomes should have Q40+. Which regions of the genome display lower consensus accuracies (is it random or sequence specific)?
  
  Minor comments:
  
  L.900: PRJNAxxxxxx looks like a placeholder, insert the true number,please.
2. GigaScience 26 Aug 2025
  
  in GigaScience
  
  AbstractBackground The agamid dragon lizard Pogona vitticeps is one of the most popular domesticated reptiles to be kept as pets worldwide. The capacity of breeding in captivity also makes it emerging as a model species for a range of scientific research, especially for the studies of sex chromosome origin and sex determination mechanisms.Results By leveraging the CycloneSEQ and DNBSEQ sequencing technologies, we conducted whole genome and long-range sequencing for a captive-bred ZZ male to construct a chromosome-scale reference genome for P. vitticeps. The new reference genome is ∼1.8 Gb in length, with a contig N50 of 202.5 Mb and all contigs anchored onto 16 chromosomes. Genome annotation assisted by long-read RNA sequencing greatly expanded the P. vitticeps lncRNA catalog. With the chromosome-scale genome, we were able to characterize the whole Z sex chromosome for the first time. We found that over 80% of the Z chromosome remains as pseudo-autosomal region (PAR) where recombination is not suppressed. The sexually differentiated region (SDR) is small and occupied mostly by transposons, yet it aggregates genes involved in male development, such as AMH, AMHR2 and BMPR1A. Finally, by tracking the evolutionary origin and developmental expression of the SDR genes, we proposed a model for the origin of P. vitticeps sex chromosomes which considered the Z-linked AMH as the master sex-determining gene.Conclusions Our study provides novel insights into the sex chromosome origin and sex determination of this model lizard. The near-complete P. vitticeps reference genome will also benefit future study of amniote evolution and may facilitate genome-assisted breeding.Competing Interest StatementThe authors have declared no competing interest.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf079), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 2: Nazila Koochekian
  
  Impressive work but needs major revision to be accepted. The authors compressed everything in the result section and did not put enough effort into the other sections. Introduction and discussion need major changes and more details regarding many aspects of the study that comes in the results. Methods need rearrangement. It's common to keep the order of methods such as first DNA extraction, then sequencing and so on. The data availability needs to be completed. Biosamples for each sequenced tissue, all the reads, and even intermediate assemblies need to be submitted to the database and reported in the manuscript. More specific comments are on the copy of the manuscript attached for the authors.
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2024.09.05.611321v1
www.biorxiv.org www.biorxiv.org

SeuratExtend: Streamlining Single-Cell RNA-Seq Analysis Through an Integrated and Intuitive Framework

2
1. GigaScience 04 Aug 2025
  
  in GigaScience
  
  ABSTRACTSingle-cell RNA sequencing (scRNA-seq) has revolutionized the study of cellular heterogeneity, but the rapid expansion of analytical tools has proven to be both a blessing and a curse, presenting researchers with significant challenges. Here, we present SeuratExtend, a comprehensive R package built upon the widely adopted Seurat framework, which streamlines scRNA-seq data analysis by integrating essential tools and databases. SeuratExtend offers a user-friendly and intuitive interface for performing a wide range of analyses, including functional enrichment, trajectory inference, gene regulatory network reconstruction, and denoising. The package seamlessly integrates multiple databases, such as Gene Ontology and Reactome, and incorporates popular Python tools like scVelo, Palantir, and SCENIC through a unified R interface. SeuratExtend enhances data visualization with optimized plotting functions and carefully curated color schemes, ensuring both aesthetic appeal and scientific rigor. We demonstrate SeuratExtend’s performance through case studies investigating tumor-associated high-endothelial venules and autoinflammatory diseases, and showcase its novel applications in pathway-Level analysis and cluster annotation. SeuratExtend empowers researchers to harness the full potential of scRNA-seq data, making complex analyses accessible to a wider audience. The package, along with comprehensive documentation and tutorials, is freely available at GitHub, providing a valuable resource for the single-cell genomics community.Practitioner PointsSeuratExtend streamlines scRNA-seq workflows by integrating R and Python tools, multiple databases (e.g., GO, Reactome), and comprehensive functional analysis capabilities within the Seurat framework, enabling efficient, multi-faceted analysis in a single environment.Advanced visualization features, including optimized plotting functions and professional color schemes, enhance the clarity and impact of scRNA-seq data presentation.A novel clustering approach using pathway enrichment score-cell matrices offers new insights into cellular heterogeneity and functional characteristics, complementing traditional gene expression-based analyses.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf076), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer name: Daniel A. Skelly
  
  Overall, this is a very nice writeup of a useful package that extends the Seurat package to expand possibilities for single cell analysts in R. I liked the visualization options, the ability to try certain python-based tools easily in R which was not previously easy, and some of the authors' new innovations like their use of pathway enrichment scores in broad ways. Kudos to the authors for releasing a package with really excellent documentation and tutorials!
  
  I think this paper could be made better if the authors stressed with a little more clarity how specifically their work is innovative. The text in the present manuscript is fine but reads like a bit of a grab bag of functionality. For example, from the abstract: "SeuratExtend offers a user-friendly and intuitive interface for performing a wide range of analyses, including functional enrichment, trajectory inference, gene regulatory network reconstruction, and denoising. The package integrates multiple databases, … and incorporates popular Python tools … [We] showcase its novel applications in pathway-level analysis and cluster annotation. SeuratExtend enhances data visualization …"
  
  How could they be more clear or specific? One example could be by categorizing what SeuratExtend can do that other packages can't. For example, I see innovations in perhaps three general areas: 1. Making single cell analyses easier/faster/prettier (i.e. visualizations, pathway enrichment) 2. Making previously published single cell tools more broadly accessible (e.g. first option to bring certain python tools to R) 3. New innovations (e.g. dimensionality reduction and clustering based on pathway enrichment scores; may not be completely new but I don't recall seeing this elsewhere) If this was added I feel the paper would more clearly communicate to readers the information necessary for them to choose whether they want to try the package.
  
  I have the following additional significant comments: * Integration of multiple databases for GSEA — these methods are good, but what about in a few years when those databases have been updated? Do the authors intend to continue updating? Could they provide a function for users to use their own database (e.g. .gaf and .obo files, for example for another model organism)? Similar comment about gene identifer conversion, which may need to be updated every few years. * "While the Python ecosystem has benefited greatly from the comprehensive scverse project [7], which utilizes the universal AnnData format to connect various tools and algorithms, a comparable integrated solution has been lacking in the R community. SeuratExtend addresses this gap by providing a unified framework centered around the Seurat object, effectively becoming the R counterpart to scverse." —> some might argue that SeuratWrappers is this solution. The authors should more clearly and explicitly comment on what SeuratExtend does differently/better than SeuratWrappers. * I'm not particularly convinced by the authors' example studies that used SeuratExtend. For example, they describe Hua-Vella et al. (2022) and Hua et al. (2023). These are very nice studies and I have no doubt they made use of SeuratExtend in their analyses. But I don't see anything these authors describe those authors doing as being uniquely possible with SeuratExtend. Perhaps SeuratExtend made their analyses easier, or faster. But it would be better if we had some further concrete details. For example, something communicating a message like one of the following: (1) the authors only tested method X on a whim because it was so easy to run in SeuratExtend, and found that it revealed unexpected biology Y; or (2) the authors were able to bring together method X which runs in R and method Y which runs in python and the joint inference — not possible in other packages — revealed key result Z. If the authors of this manuscript can't point to those sorts of examples, then I'm not sure it adds much to include this discussion in the present paper. * I really liked the section "Novel Applications of SeuratExtend in Pathway-Level Analysis and Cluster Annotation", especially "Exploring and Analyzing Single-Cell Data at the Pathway Level". I thought these applications could perhaps be stressed a bit more strongly or made more prominent earlier in the paper. * Figures 2 and 3 are showing example plots from which we don't actually need to infer any important biology. I thought these figures could be combined and each individual plot type only shown once. (This is for clarity and I don't see anything incorrect about the authors' current plots. * There may be some issues with dependencies for some users. For example, it prompted me to install viridis and loomR as I went through the Quickstart. I ended up encountering an error there is no package called 'loomR' while trying. I had to manually install with remotes::install_github(repo = "mojaveazure/loomR"). Maybe provide an explicit dependencies list/list of recommended packages to install? * I had an error the first time calling Palantir.RunDM(). I hadn't created a seuratextend environment. I found that I could do this manually using create_condaenv_seuratextend(), but that this wasn't supported for Apple Silicon chips. I would suggest that the authors do try to find a way to get this working on newer Apple chips, because Mac machines are very common among bioinformaticians in my experience. * While the writing is largely quite clear, I found it to be a bit voluminous. If the authors are able to cut down on text length that may help in emphasizing the key points that make their package valuable to users.
  
  I had these minor comments: * "Moreover, mainstream scRNA-seq analysis tools are primarily developed for either the R or Python platforms, with additional options like Nextflow and Snakemake" — I suggest revising this sentence. The tools are developed in R or python languages, which I would not call platforms. I would reword that Nextflow and Snakemake are workflow management systems that provide additional options for pipeline automation * "the R ecosystem surrounding Seurat appears relatively limited" — I'm not sure I would agree with this. I counted wrappers for 17 methods currently. Yes it is true that there are more packages in scverse. However, I suggest moderating your claims about Seurat being limited. * Suggest removing snakemake from Table 1 — it is really different from the other tools listed there
2. GigaScience 04 Aug 2025
  
  in GigaScience
  
  ABSTRACTSingle-cell RNA sequencing (scRNA-seq) has revolutionized the study of cellular heterogeneity, but the rapid expansion of analytical tools has proven to be both a blessing and a curse, presenting researchers with significant challenges. Here, we present SeuratExtend, a comprehensive R package built upon the widely adopted Seurat framework, which streamlines scRNA-seq data analysis by integrating essential tools and databases. SeuratExtend offers a user-friendly and intuitive interface for performing a wide range of analyses, including functional enrichment, trajectory inference, gene regulatory network reconstruction, and denoising. The package seamlessly integrates multiple databases, such as Gene Ontology and Reactome, and incorporates popular Python tools like scVelo, Palantir, and SCENIC through a unified R interface. SeuratExtend enhances data visualization with optimized plotting functions and carefully curated color schemes, ensuring both aesthetic appeal and scientific rigor. We demonstrate SeuratExtend’s performance through case studies investigating tumor-associated high-endothelial venules and autoinflammatory diseases, and showcase its novel applications in pathway-Level analysis and cluster annotation. SeuratExtend empowers researchers to harness the full potential of scRNA-seq data, making complex analyses accessible to a wider audience. The package, along with comprehensive documentation and tutorials, is freely available at GitHub, providing a valuable resource for the single-cell genomics community.Practitioner PointsSeuratExtend streamlines scRNA-seq workflows by integrating R and Python tools, multiple databases (e.g., GO, Reactome), and comprehensive functional analysis capabilities within the Seurat framework, enabling efficient, multi-faceted analysis in a single environment.Advanced visualization features, including optimized plotting functions and professional color schemes, enhance the clarity and impact of scRNA-seq data presentation.A novel clustering approach using pathway enrichment score-cell matrices offers new insights into cellular heterogeneity and functional characteristics, complementing traditional gene expression-based analyses.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf076), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer name: Yu H. Sun
  
  This manuscript introduces an extended version of the widely-used Seurat package, named SeuratExtend. Specifically, Hua et al. developed an integrated an intuitive framework to streamline scRNA-seq data analysis, such as trajectory analysis, GRN construction, and functional enrichment analysis. The package also features direct integration with other popular tools, including Seurat, scVelo, etc. Notably, the software has been demonstrated through training programs, with over 100 stars on GitHub, which is impressive. I have tested the package, including installation and some basic functions. Moreover, the GitHub webpage is well-documented, featuring multiple use cases tailored for beginners. The overall user experience exceeded my expectations, though I have a few minor comments for improvement:
  
  1, The DimPlot2 function is very useful, and easy to customize the colors. However, the default color scheme seems to be too dark. Considering a more distinguishable and visually appealing color palette might be a solution.
  
  2, How to control the angles of cell type labels when using VlnPlot2? The 'Split visualization' has all the labels in a horizontal direction, leading to overlapping in some cases, while 'Subset Analysis' plots have labels in 45 degree, which is much better to read. However, I didn't see a parameter to control this. Does VlnPlot2 handle this automatically?
  
  3, It's a very nice feature to have the 'Statistical Analysis' function to label significant groups. However, in single cell analysis, the p values are easy to be inflated due to the large number of cells. While the example pmbc data is relatively small, larger datasets might yield significant p values without obvious differences in the violin plots. It would be beneficial to mention this in the documentation, and provide some guidance so the results won't be misleading.
  
  4, The ClusterDistrBar is another valuable function. Based on my experience with similar analyses, I suggest incorporating features to identify robust changes in cell type composition. For instance, tools like sccomp can help determine changes in cell population composition.
  
  5, I wonder if the gene label directions can be changed easily for WaterfallPlot?
  
  6, Regarding the volcano plot, does LogFC mean log2 or log(e)? I noticed that this may not be consistent if you used different tools. For example, some tools like Seurat FindMarkers uses Log2, while NEBULA uses Log(e). Clear labeling on the x-axis and tutorial guidance would help ensure consistency.
  
  7, Very nice introduction about the color palettes at the end of the Enhanced Visualization tutorial.
  
  8, The incorporation of python tools into R is innovative, including scVelo, Palantir. There may be a need to continue incorporating new tools, such as Dynamo, a newer tool I started to use recently. While this is not required for the current revision, it could be a valuable direction for future development.
  
  Overall, this tool represents a comprehensive extension of Seurat, combining enhanced visualization, pathway enrichment, and trajectory analysis into a single package. I look forward to seeing a revised version of this manuscript.
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2024.08.01.606144v1
www.biorxiv.org www.biorxiv.org

Multiomics uncovers the epigenomic and transcriptomic response to viral and bacterial stimulation in turbot

3
1. GigaScience 04 Aug 2025
  
  in GigaScience
  
  AbstractUncovering the epigenomic regulation of immune responses is essential for a comprehensive understanding of host defence mechanisms, though remains poorly investigated in farmed fish. We report the first annotation of the innate immune regulatory response in the turbot genome (Scophthalmus maximus), integrating RNA-Seq with ATAC-Seq and ChIP-Seq (H3K4me3, H3K27ac and H3K27me3) data from head kidney (in vivo) and primary leukocyte cultures (in vitro) 24 hours post-stimulation with viral (poly I:C) and bacterial (inactive Vibrio anguillarum) mimics. Among the 8,797 differentially expressed genes (DEGs), we observed enrichment of transcriptional activation pathways in response to Vibrio and immune pathways - including interferon stimulated genes - for poly I:C. We identified notable differences in chromatin accessibility (20,617 in vitro, 59,892 in vivo) and H3K4me3-bound regions (11,454 in vitro, 10,275 in vivo) between stimulations and controls. Overlap of DEGs with promoters showing differential accessibility or histone mark binding revealed significant coupling of the transcriptome and chromatin state. DEGs with activation marks in their promoters were enriched for similar functions to the global DEG set, but not always, suggesting key regulatory genes being in poised state. Active promoters and putative enhancers were enriched in specific transcription factor binding motifs, many common to viral and bacterial responses. Finally, an in-depth analysis of immune response changes in chromatin state surrounding key DEGs encoding transcription factors was performed. This multi-omics investigation provides an improved understanding of the epigenomic basis for the turbot immune responses and provides novel functional genomic information, leverageable for disease resistance selective breeding.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf077), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer name: Aijun Ma
  
  In the manuscript "Multiomics uncovers the epigenomic and transcriptomic response to viral and bacterial stimulation in turbot", many investigations were applied to uncover the immune regulatory response in the turbot. This multi-omics investigation provided an improved understanding of the epigenomic basis of turbot immune response and offers novel functional genomic information. However, some aspects need to be considered in order to improve the manuscript, as indicated below. 1 Line 16: In this sentence, authors used "the innate immune regulatory response" to describe the response of these two stimuli in a tissue and cell. Innate immunity is a very strict term, and it is not appropriate to use it here. 2 Line 34-36: poly I:C and inactive Vibrio anguillarum were just like PAMP, the response to these two stimulations cannot represent the process of disease defense. The sentence "which can be leveraged for disease resistance selective breeding" was listed in conclusions, that was not accurate. Suggest moving this sentence to the outlook section. 3 Line 80-87: Head kidney is a key lymphoid organ in most marine fishes, and plays central role in fish immunity. It is inappropriate to only talk about its innate immune function. Vibrio is a common bacterium in seawater, while Vibrio anguillarum is an opportunistic pathogen. Strictly speaking, experimental fish will inevitably meet Vibrio during the breeding process before the experiment. Suggest reorganizing the sentences of this paragraph.
2. GigaScience 04 Aug 2025
  
  in GigaScience
  
  AbstractUncovering the epigenomic regulation of immune responses is essential for a comprehensive understanding of host defence mechanisms, though remains poorly investigated in farmed fish. We report the first annotation of the innate immune regulatory response in the turbot genome (Scophthalmus maximus), integrating RNA-Seq with ATAC-Seq and ChIP-Seq (H3K4me3, H3K27ac and H3K27me3) data from head kidney (in vivo) and primary leukocyte cultures (in vitro) 24 hours post-stimulation with viral (poly I:C) and bacterial (inactive Vibrio anguillarum) mimics. Among the 8,797 differentially expressed genes (DEGs), we observed enrichment of transcriptional activation pathways in response to Vibrio and immune pathways - including interferon stimulated genes - for poly I:C. We identified notable differences in chromatin accessibility (20,617 in vitro, 59,892 in vivo) and H3K4me3-bound regions (11,454 in vitro, 10,275 in vivo) between stimulations and controls. Overlap of DEGs with promoters showing differential accessibility or histone mark binding revealed significant coupling of the transcriptome and chromatin state. DEGs with activation marks in their promoters were enriched for similar functions to the global DEG set, but not always, suggesting key regulatory genes being in poised state. Active promoters and putative enhancers were enriched in specific transcription factor binding motifs, many common to viral and bacterial responses. Finally, an in-depth analysis of immune response changes in chromatin state surrounding key DEGs encoding transcription factors was performed. This multi-omics investigation provides an improved understanding of the epigenomic basis for the turbot immune responses and provides novel functional genomic information, leverageable for disease resistance selective breeding.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf077), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer name: Elisabeth Busch-Nentwich
  
  This is a careful analysis of a large and high-quality dataset that will be a very useful resource for researchers across disciplines. I commend the authors on their extensive metadata, and comprehensive and well annotated data tables, which make this a truly accessible resource. I don't have any major criticism. A few minor points: 1. Typo in Figure 1 (it's immature, not inmature) 2. In Fig 3 Upset plots could be a bit easier to parse 3. Fig 5 doesn't have a legend for the blue gradient (but it's pretty self-explanatory)
3. GigaScience 04 Aug 2025
  
  in GigaScience
  
  AbstractUncovering the epigenomic regulation of immune responses is essential for a comprehensive understanding of host defence mechanisms, though remains poorly investigated in farmed fish. We report the first annotation of the innate immune regulatory response in the turbot genome (Scophthalmus maximus), integrating RNA-Seq with ATAC-Seq and ChIP-Seq (H3K4me3, H3K27ac and H3K27me3) data from head kidney (in vivo) and primary leukocyte cultures (in vitro) 24 hours post-stimulation with viral (poly I:C) and bacterial (inactive Vibrio anguillarum) mimics. Among the 8,797 differentially expressed genes (DEGs), we observed enrichment of transcriptional activation pathways in response to Vibrio and immune pathways - including interferon stimulated genes - for poly I:C. We identified notable differences in chromatin accessibility (20,617 in vitro, 59,892 in vivo) and H3K4me3-bound regions (11,454 in vitro, 10,275 in vivo) between stimulations and controls. Overlap of DEGs with promoters showing differential accessibility or histone mark binding revealed significant coupling of the transcriptome and chromatin state. DEGs with activation marks in their promoters were enriched for similar functions to the global DEG set, but not always, suggesting key regulatory genes being in poised state. Active promoters and putative enhancers were enriched in specific transcription factor binding motifs, many common to viral and bacterial responses. Finally, an in-depth analysis of immune response changes in chromatin state surrounding key DEGs encoding transcription factors was performed. This multi-omics investigation provides an improved understanding of the epigenomic basis for the turbot immune responses and provides novel functional genomic information, leverageable for disease resistance selective breeding.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf077), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer name: Laura Caquelin
  
  Summary of the Study This study provides the first multi-omics investigation of the innate immune response in turbot (Scophthalmus maximus). By integrating RNA-Seq, ATAC-Seq, and ChIP-Seq data, researchers identified changes in gene expression, chromatin accessibility, and histone modifications after viral and bacterial stimulation. The findings reveal a significant coupling between the transcriptome and chromatin state, offering insights for the selection of disease resistance in aquaculture.
  
  Scope of reproducibility
  
  According to our assessment the primary objective is: Association of ATAC-Seq and ChIP-Seq data with RNA-Seq data
  
  ● Outcome: Overlap of promoter DARs and DHMRs with DEG promoters ● Analysis method outcome: Hypergeometric test ● Main result: "DARs and DHMRs were much more overrepresented at the promoter regions of upregulated rather than downregulated DEGs" (Table 4, Supplementary Table 11; Lines 403-405, Page 9)
  
  Availability of Materials a. Data ● Data availability: Raw data are available, but generated data from the study are shared with the journal and not yet publicly available ● Data completeness: Complete ● Access Method: Manuscript's supplementary files/Private journal dropbox ● Repository: - ● Data quality: Structured, but lacks variable definitions in supplementary files, making it difficult to interpret and use. b. Code ● Code availability: Not available for the primary result ● Programming Language(s): Excel ● Repository link: - ● License: - ● Repository status: - ● Documentation: README lacks information on hypergeometric test.
  
  Computational environment of reproduction analysis
  
  ● Operating system for reproduction: MacOS 14.7.4 ● Programming Language(s): Excel ● Code implementation approach: Excel formulas based on methodology description provided by authors ● Version environment for reproduction: Excel version 16.94
  
  Results
  
  5.1 Original study results ● Results 1: Table 4 and supplementary table 11
  
  5.3 Steps for reproduction
  
   Reproduce supplementary table 11 to perform hypergeometric test * Issue 1: No code or instructions for constructing Table 4 in manuscript and README text. ▪ Resolved: Authors shared methodology upon request Authors' Clarification: The hypergeometric test wasn't carried out with any particular script but with the following public online tool, that can be replicated in excel: https://systems.crump.ucla.edu/hypergeometric/ The tool basically runs the following excel formulas: Cumulative distribution function (CDF) of the hypergeometric distribution in Excel =IF(k>=expected,1-HYPGEOM.DIST(k-1,s,M,N,TRUE),HYPGEOM.DIST(k,s,M,N,TRUE)) =IF(k>=((sM)/N),1-HYPGEOM.DIST(k-1,s,M,N,TRUE),HYPGEOM.DIST(k,s,M,N,TRUE)) expected = (sM)/N direction =IF(k=expected,"match",IF(k<expected,"de-enriched","enriched")) fold change =IF(k<expected,expected/k,k/expected)
  
  where k is the number of successes (intersection of DAR/DHMR in promoters + DEG), s the sample size (DEG), M the number of successes in the population (DAR/DHMR in promoters) and N the population size (28.602 genes). For each condition, the count of downregulated and upregulated DEG (s) was taken from supplementary table 4. Similarly, the count of downregulated and upregulated DAR/DHMR (M) was taken from supplementary table 10, considering only differential peaks that are annotated as "promoter-TSS" in the annotation column (column M). The population size (N) was the total list of genes that were DEG, DAR or DHMR (combining the data on supplementary tables 4 and 11, eliminating duplicates). Finally, the intersection of of DAR and DEG (k) for each condition was retrieved with the following venn diagram online tool: https://bioinformatics.psb.ugent.be/webtools/Venn/" * Issue 2: Discrepancies in DEG counts from supplementary table 11 ▪ Resolved: Investigated variable definitions (using the wrong variable - strand), confirmed that log2FoldChange determines up/down-regulation * Issue 3: Filling in DAR/DHMR values ▪ Unresolved: Unclear correspondence between "promoters" rows and excel file sheets. Does H3K27me3 correspond to the promoters? * Issue 4: Using the Venn diagram tool to find intersections ▪ Unresolved: Worked for one condition (ATC vivo poly (down)) but failed for ATAC vitro-vibrio and ATAC-vivo-vibrio. Tool returns a "Request Entity Too Large" error. * Issue 5: Define the population size ▪ Unresolved: The instructions for defining the population size are not clear. In supplementary table 4, it seems that the variable "Gene ID (ENSEMBL)" should be used, but in supplementary table 10, should the variable "Nearest PromoterID" or "Gene symbol" be used?  Using supplementary table 11 values to perform hypergeometric test Having failed to obtain the values required to reproduce supplementary table 11, the data already provided were used to obtain the "enrichment" and "p-value" values using the excel function provided. * Issue 1: Comparison of p-values ▪ Resolved: For Up condition, extremely small p-values are not displayed correctly due to Excel's limitations in scientific notation. Excel may either display them as zero or in an incomplete scientific format (e.g., 0.00E+00). Using the tool on the web.
  
  5.4 Statistical comparison Original vs Reproduced results ● Results: Based on the available data in supplementary table 11, the "enrichment" and "p-value" values have been successfully reproduced in most cases. ● Comments: The full table could not be reproduced, particularly the data corresponding to DAR/DHMR, DAR/DHMR+DEG and population size values, due to missing information or unclear definitions in the supplementary files. ● Errors detected: The enrichment value for the Up condition of promoters-vitro-vibrio was incorrectly reported in the manuscript/table. Based on the Excel formula and the online tool used, the correct value appears to be 2.28 instead of 2.82. ● Statistical Consistency: All the values that could be reproduced from the available data matched the original results, except for the detected error.
  
  Conclusion
  
  Summary of the computational reproducibility review The study's results were partially reproduced. Key values such as enrichment and p-values were successfully replicated, but some dataset elements (DAR/DHMR, DAR/DHMR+DEG, and size population) could not be verified due to insufficient methodological details provided in the manuscript. An error in the enrichment value for the Up condition of promoters-vitro-vibrio was identified (2.28 instead of 2.82). The p values used for statistical inference were however successfully reproduced.
  
  Recommendations for authors o Improve data documentation: Define variables in supplementary files. o Provide all code and scripts: Share the excel formulas used for table 4/supplementary table 11. o Clarify statistical methodology: Include detailed methods description for the hypergeometric test. o Enhance reproducibility workflow: Provide a structured README with all necessary steps.
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2024.02.15.580452v2
Jul 2025
www.biorxiv.org www.biorxiv.org

Chevreul: An R Bioconductor Package for Exploratory Analysis of Full-Length Single Cell Sequencing

2
1. GigaScience 30 Jul 2025
  
  in GigaByte
  
  Editors Assessment:
  
  This paper presents Chevreul, a new open-source R Bioconductor (meta-)package for processing and integration of scRNA-seq data from cDNA end-counting, full-length short-read or long-read protocols. Alongside a R Shiny app for easy visualization, formatting, and analysis for exploratory analyses of scRNA-seq data processed in the SingleCellExperiment Bioconductor or Seurat formats. The name of the tool is inspired by the colour theorist Michel-Eugène Chevreul and the optical illusion of the same name. To demonstrate the use of Chevreul, the authors provide a sample analysis, which helps to demonstrate how users can visualize a wide range of parameters, enabling transparent and reproducible scRNA-seq analyses. Peer review also pushing the author to provide extensive guidance materials to assist with use. Being implemented in R, the R package and integrated Shiny application are freely available under an open-source MIT license in Bioconductor and their GitHub page here: https://github.com/cobriniklab/chevreul
  
  This evaluation refers to version 1 of the preprint
  
  Summary
2. GigaScience 30 Jul 2025
  
  in GigaByte
  
  AbstractChevreul is an open-source R Bioconductor package and interactive R Shiny app for processing and visualization of single cell RNA sequencing (scRNA-seq) data. It differs from other scRNA- seq analysis packages in its ease of use, its capacity to analyze full-length RNA sequencing data for exon coverage and transcript isoform inference, and its support for batch correction. Chevreul enables exploratory analysis of scRNA-seq data using Bioconductor SingleCellExperiment or Seurat objects. Simple processing functions with sensible default settings enable batch integration, quality control filtering, read count normalization and transformation, dimensionality reduction, clustering at a range of resolutions, and cluster marker gene identification. Processed data can be visualized in an interactive R Shiny app with dynamically linked plots. Expression of gene or transcript features can be displayed on PCA, tSNE, and UMAP embeddings, heatmaps, or violin plots while differential expression can be evaluated with several statistical tests without extensive programming. Existing analysis tools do not provide specialized tools for isoform-level analysis or alternative splicing detection. By enabling isoform-level expression analysis for differential expression, dimensionality reduction and batch integration, Chevreul empowers researchers without prior programming experience to analyze full-length scRNA-seq data.Data availability A test dataset formatted as a SingleCellExperiment object can be found at https://github.com/cobriniklab/chevreuldata.
  
  Reviewer 1. Dr. Luyi Tian and Dr. Hongke Peng
  
  Is there a clear statement of need explaining what problems the software is designed to solve and who the target audience is? Yes. Thus, the statement of need is well-defined, addressing both the problem (complexity of scRNA-seq data analysis without programming skills) and the intended audience (non-programming researchers in the field).
  
  Additional Comments: This study provides Chevreul, a Bioconductor package, for analysis and visualization of single-cell sequencing data. This package contains a shinny app. It also provide the functions which implemented by a set of bioconductor packages for standard scRNA-seq analysis to generate the necessary input of the shinny app. I believe that this app can provide an additional option for researchers who work with single-cell data. However, there might be a few comments need addressing.
  
  While the title emphasizes "exploratory analysis of full-length single-cell sequencing," the authors do not explicitly mention the analysis full-length data (e.g., isoform detection or quantification). For instance, the “sce_process(...)” pipeline figure lacks specific steps addressing full-length sequencing workflows. To strengthen this claim, the authors might need to mention/summarize the methods for isoform detection and quantification, for both annotated and novel ones. It would be better to specify recommended tools for transcript-level analysis (e.g., transcript assembly or differential isoform usage) that integrate with Chevreul's visualization features. Meanwhile, The manuscript focuses on Smart-seq as the representative full-length method. It might also be helpful to discuss other full-length methods such as ONT nanopore sequencing or PacBio, in aspect of data processing, transcript assembly, de novel usage or potential challenges in adapting Chevreul to these platforms, etc.
  
  There is another minor suggestion. Functions mentioned in the text and Figure 1 (e.g., “sce_process”, “sce_integrate”) should include parentheses (e.g., “sce_process()”) to align with R syntax conventions and clarify their roles as package functions.
  
  Re-review: I am happy with the revision and author have fully addressed my concerns.
  
  Reviewer 2. Dr.Tianhang Lv
  
  Is there a clear statement of need explaining what problems the software is designed to solve and who the target audience is? Yes. Chevreul provides tools for exploratory analysis of single-cell data and offers essential tools for the analysis and visualization of single-cell full-length transcriptomes. In several sections of the article, the authors discuss the key computational challenges addressed by this software. However, in the abstract, they need to emphasize the advantages of Chevreul in single-cell full-length transcript analysis (the current version lacks sufficient description). In the "Statement of Need" section, the authors could also highlight the limitations of existing single-cell full-length transcript analysis tools and introduce the advantages of Chevreul in this regard.
  
  Is installation/deployment sufficiently outlined in the paper and documentation, and does it proceed as outlined?
  
  Yes. Although the authors have provided installation documentation, the current documentation on GitHub is not user-friendly. For example, the page at https://github.com/cobriniklab/chevreul does not include code for importing seuratTools, yet it runs the built-in function clustering_workflow from seuratTools. Additionally, the current documentation is overly simplistic and not accessible to those without programming experience.
  
  Is the documentation provided clear and user friendly?
  
  No. The authors have separated the example workflows for SingleCellExperiment objects and Seurat objects into two different GitHub projects, which is not conducive for users to understand the structure of Chevreul or to facilitate learning. Additionally, the batch integration mentioned in the article lacks specific implementation examples. The authors should at least provide implementation examples for the results mentioned in the manuscript. Furthermore, the current documentation needs further refinement to truly enable individuals without programming expertise to easily analyze single-cell data.
  
  Is there a clearly-stated list of dependencies, and is the core functionality of the software documented to a satisfactory level?
  
  No. The authors have developed an excellent Shiny app for single-cell visualization, enabling users without programming expertise to freely export visualization results from single-cell analysis. The installation commands provided by the authors on https://github.com/cobriniklab/chevreul do indeed allow for the installation of Chevreul. However, Chevreul involves nearly 300 dependency packages, including sub-libraries developed by the authors (seuratTools, chevreulPlot, chevreuldata, chevreulPlot, chevreulProcess, chevreulShiny) as dependencies. Relying solely on the installation commands provided by the authors to install all dependency packages may result in some packages (especially large ones) failing to install due to network bandwidth issues, which is not user-friendly for those without programming experience. Additionally, could the numerous dependency packages of Chevreul potentially cause dependency conflicts with existing R environments? Should the authors recommend users to deploy Chevreul in a new R environment? It is recommended that the authors provide a step-by-step installation guide, explaining potential issues and solutions during the installation process based on the dependencies of Chevreul and its sub-libraries. By installing dependency packages step by step, users can gradually complete the installation of Chevreul. The current installation documentation is clearly not user-friendly for non-programmers and does not align with the authors' statement in the manuscript: "It differs from other scRNAseq analysis packages in its ease of installation and use." At present, the installation documentation provided by the authors may not meet the original design intent of Chevreul. Additionally, the authors should specify that Chevreul supports Seurat version V5.
  
  Have any claims of performance been sufficiently tested and compared to other commonly-used packages?
  
  No. The authors could provide specifications for the minimum hardware requirements needed to run Chevreul, such as the number of CPU cores and the amount of memory. Additionally, the authors could offer data on the runtime of Chevreul as the volume of data increases.
  
  Is automated testing used or are there manual steps described so that the functionality of the software can be verified? No.
  
  Additional Comment. The authors have developed an R Shiny app for single-cell exploratory data analysis, which will significantly expand the application scenarios of single-cell data analysis and bring great benefits to a wide range of biology practitioners. The large size of Chevreul's installation package indicates the considerable difficulty in its development, reflecting the immense wisdom and effort the authors have invested in creating this package. Chevreul's advantages in visualization and analysis are evident, and if further developed and refined, it is certain to attract even more users in the future. To ensure that such an excellent package as Chevreul can be easily and quickly adopted by users, several suggestions for improving the documentation and enhancing user-friendliness are provided. We hope the authors can refine the package based on the reviewers' feedback and recommendations.
  
  Re-review: I have carefully reviewed the revised manuscript and am satisfied that all my comments have been adequately addressed. The authors have resolved the software errors reported in the original submission by updating the relevant shiny app modules. They have also enhanced the package documentation to assist users without programming experience in installing and using Chevreul. In the manuscript itself, the authors have provided detailed responses and explanations to each of my points.
  
  Overall, they have addressed all of my comments thoroughly. That said, a few minor issues remain in the manuscript (revised version with tracked changes) that should be corrected to ensure consistency with academic publishing standards and to help readers better learn how to use Chevreul: 1. On line 52, the placeholder “(doi reference for Shayler et al. data to be provided)” appears—did the authors forget to insert the citation or data link? 2. On line 96, would it be more appropriate to replace “SingleCellExperiments” with “SingleCellExperiment objects”? 3. On line 119, please add a space so that “databases[19–21]used” reads “databases [19–21] used.” 4. For consistency, should the second occurrence of “batchelor” on line 132 be italicized? 5. The Chevreul link is already cited in the “Availability & Implementation” section and need not be repeated in the Figure 1 legend. 6. On line 184, the gene symbol “NRL” should be set in italic Latin script. 7. On the GitHub page (https://github.com/cobriniklab/chevreul), the phrase “A demo with a developing human retina scRNA-seq dataset from Shayler et al. is available here” points to an inaccessible web demo. Restoring this demo in a future update would greatly facilitate experimental biologists in learning and using Chevreul.
Visit annotations in context

Tags

Summary

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2025.05.27.656486v1
www.biorxiv.org www.biorxiv.org

CellBinDB: A Large-Scale Multimodal Annotated Dataset for Cell Segmentation with Benchmarking of Universal Models

2
1. GigaScience 08 Jul 2025
  
  in GigaScience
  
  In recent years, cell segmentation techniques have played a critical role in the analysis of biological images, especially for quantitative studies. Deep learning-based cell segmentation models have demonstrated remarkable performance in segmenting cell and nucleus boundaries, however, they are typically tailored to specific modalities or require manual tuning of hyperparameters, limiting their generalizability to unseen data. Comprehensive datasets that support both the training of universal models and the evaluation of various segmentation techniques are essential for overcoming these limitations and promoting the development of more versatile cell segmentation solutions. Here, we present CellBinDB, a large-scale multimodal annotated dataset established for these purposes. CellBinDB contains more than 1,000 annotated images, each labeled to identify the boundaries of cells or nuclei, including 4’,6-Diamidino-2-Phenylindole (DAPI), Single-stranded DNA (ssDNA), Hematoxylin and Eosin (H&E), and Multiplex Immunofluorescence (mIF) staining, covering over 30 normal and diseased tissue types from human and mouse samples. Based on CellBinDB, we benchmarked seven state-of-the-art and widely used cell segmentation technologies/methods, and further analyzed the effects of four cell morphology indicators and image gradient on the segmentation results.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf069 ), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer: Shan Raza
  
  The paper presents a multimodal data set for cell segmentation and benchmarking. The major strength of the dataset is its multimodal nature and including both mouse and human tissue. The paper analyses existing data sets and the performance of state-of-the-art methods. However, the authors missed one of the biggest data sets on the cell segmentation and classification which includes more than 500,000 annotated nuclei in H&E https://www.sciencedirect.com/science/article/pii/S1361841523003079.
  
  The CoNIC challenge paper also analysis state-of-the-art nuclei segmentation and classification methods. The authors should add one of the best performing models in their analysis. I would also suggest the authors to include PQ and froc in the metrics to analyse the results as this is commonly used in this domain for comparison. I would also suggest to compare the results with HoVerNet or HoVerNext (https://github.com/digitalpathologybern/hover_next_train) which are state-of-the-art algorithms for nuclei instance segmentation. The code for these algorithms is publicly available.
2. GigaScience 08 Jul 2025
  
  in GigaScience
  
  In recent years, cell segmentation techniques have played a critical role in the analysis of biological images, especially for quantitative studies. Deep learning-based cell segmentation models have demonstrated remarkable performance in segmenting cell and nucleus boundaries, however, they are typically tailored to specific modalities or require manual tuning of hyperparameters, limiting their generalizability to unseen data. Comprehensive datasets that support both the training of universal models and the evaluation of various segmentation techniques are essential for overcoming these limitations and promoting the development of more versatile cell segmentation solutions. Here, we present CellBinDB, a large-scale multimodal annotated dataset established for these purposes. CellBinDB contains more than 1,000 annotated images, each labeled to identify the boundaries of cells or nuclei, including 4’,6-Diamidino-2-Phenylindole (DAPI), Single-stranded DNA (ssDNA), Hematoxylin and Eosin (H&E), and Multiplex Immunofluorescence (mIF) staining, covering over 30 normal and diseased tissue types from human and mouse samples. Based on CellBinDB, we benchmarked seven state-of-the-art and widely used cell segmentation technologies/methods, and further analyzed the effects of four cell morphology indicators and image gradient on the segmentation results.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf069 ), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Jeff Rhoades
  
  General comments:
  
  Dataset Innovation: CellBinDB offers a significant improvement over existing datasets with its diversity of staining types (DAPI, ssDNA, H&E, mIF) and broad tissue coverage, including normal and diseased samples.
  
  Benchmarking of Models: The evaluation of seven state-of-the-art segmentation algorithms provides valuable insights for researchers selecting tools for various imaging modalities.
  
  Analysis of Influencing Factors: The manuscript thoroughly examines biological (e.g., cell morphology) and technical (e.g., image gradient) factors affecting model performance, providing practical recommendations for improving segmentation outcomes.
  
  Preprocessing Impact: Demonstrating the effectiveness of preprocessing (e.g., grayscale conversion for H&E images) is an immediately actionable takeaway for practitioners. However, authors should apply preprocessing uniformly to all segmentation approaches, not just those that did poorly initially.
  
  Major Areas for Improvement:
  
  Preprocessing Uniformity:
  
  Apply preprocessing steps uniformly across all segmentation approaches to ensure fair comparisons and avoid bias.
  
  Inclusion of Cellpose3 Training Dataset:
  
  The manuscript should include the dataset used for training Cellpose3 in its comparisons. Cellpose3's superior generalist model performance is emphasized, yet the absence of its training dataset in the comparisons raises questions about robustness of the benchmarking.
  
  Evidence of Dataset Utility:
  
  While the dataset's benchmarking is well-done, the manuscript does not provide evidence that models trained on CellBinDB outperform those trained on other datasets. Addressing this, though potentially out of scope, would strengthen the manuscript's impact.
  
  Figure Panels:
  
  Labeling in figure panels should be clearer to enhance interpretability. For instance, indicate whether the instance or semantic masks are being shown and consider making instance segmentation masks colorful to highlight unique IDs.
  
  Semantic masks could be omitted if space is constrained, as they are largely redundant with instance masks.
  
  Ensure figures are spaced more evenly throughout the text, ideally located near their first references, to improve readability.
  
  Abstract Clarity:
  
  The abstract should better reflect the intellectual contributions of the analysis of segmentation performance factors (i.e. cell morphology and image gradients).
  
  Normalization Methods:
  
  Provide details on how cell morphology indicators are normalized in the methods section to ensure reproducibility and clarity.
  
  Explanation of Image Gradient:
  
  The discussion of gradient magnitude and its calculation using the Sobel operator requires more accessible language. Not all readers will be familiar with this concept, so additional context is essential.
  
  Tissue Classification:
  
  Group related tissues, such as "brain," "half brain," and "cerebellum," under a common "neural tissue" category for easier interpretation and analysis. Additional Suggestions:
  
  Address grammatical errors and improve clarity in some sections, such as the benchmarking pipeline description.
  
  Replace vague terms like "ML-based" when referring to CellProfiler with specific algorithmic descriptions.
  
  Including public datasets, such as Cellpose, to create a unified, all-inclusive CellBinDB dataset might significantly enhance the resource's utility for machine learning practitioners.
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2024.11.20.619750v2
www.biorxiv.org www.biorxiv.org

Extraction of biological terms using large language models enhances the usability of metadata in the BioSample database

2
1. GigaScience 08 Jul 2025
  
  in GigaScience
  
  BioSample is a comprehensive repository of experimental sample metadata, playing a crucial role in providing a comprehensive archive and enabling experiment searches regardless of type. However, the difficulty in comprehensively defining the rules for describing metadata and limited user awareness of best practices for metadata have resulted in substantial variability depending on the submitter. This inconsistency poses significant challenges to the findability and reusability of the data. Given the vast scale of BioSample, which hosts over 40 million records, manual curation is impractical. Rule-based automatic ontology mapping methods have been proposed to address this issue, but their effectiveness is limited by the heterogeneity of BioSample metadata. Recently, large language models (LLMs) have gained attention in natural language processing and have been expected as promising tools for automating metadata curation. In this study, we evaluated the performance of LLMs in extracting cell line names from BioSample descriptions using a gold-standard dataset derived from ChIP-Atlas, a secondary database of epigenomics experiment data, which manually curates samples. Our results demonstrated that LLM-assisted methods outperformed traditional approaches, achieving higher accuracy and coverage. We further extended this approach to extraction of information about experimentally manipulated genes from metadata where manual curation had not yet been applied in ChIP-Atlas. This also yielded successful results for the usage of the database, which facilitates more precise filtering of data and prevents misinterpretation caused by inclusion of unintended data. These findings underscore the potential of LLMs to improve the findability and reusability of experimental data in general, significantly reducing user workload and enabling more effective scientific data management.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf070 ), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  **Reviewer: Christopher Tabone **
  
  This manuscript evaluates the use of large language models (LLMs) to improve the consistency and usefulness of BioSample metadata. The authors focus on extracting specific biological terms from freetext sample descriptions: first, identifying cell line names (using a curated gold-standard for evaluation), and second, identifying experimentally modulated gene names (in a scenario without prior manual curation). An open-source 70B LLM (Llama 3.1) was used and its performance was compared against a conventional ontology-mapping pipeline (MetaSRA). Overall, the study is well-motivated - addressing the challenge of heterogeneous metadata - and the approach is generally sound and well documented. Below, I address specific aspects of the work in detail: Methodological Appropriateness and Controls: The methods are appropriate to the study's aims and are described with detail. The two-part evaluation (cell line extraction and gene name extraction without prior curation) aligns well with the goal of demonstrating LLM utility in metadata curation. The authors took care to construct a gold-standard dataset for cell line extraction by leveraging ChIP-Atlas's manually curated sample annotations. This approach avoids starting from scratch and ensures the evaluation is grounded in experimental metadata. The sample selection strategy is well justified: using equal numbers of ChIP-seq and ATAC-seq samples to control for the presence/absence of protein names (a potential confounder for detecting cell lines), avoiding duplicate projects and identical terms, and restricting to human samples to leverage the Cellosaurus ontology. These controls strengthen the evaluation by preventing bias (e.g. one project dominating results or trivial cases duplicating answers). The LLM pipeline is clearly outlined (Figure 2) - the model is prompted with BioSample attributes to extract a representative cell line term. Importantly, the authors compare this LLM-assisted pipeline against an existing rule-based method (the MetaSRA ontology mapping pipeline). This serves as an essential control/baseline to quantify the improvement gained by using an LLM. For the second task (extracting modulated gene names), where no curated baseline exists, the authors sample thousands of BioSample entries and perform manual evaluation of the LLM's outputs. While manual checking is necessary here, the manuscript could clarify the evaluation procedure (e.g. how many evaluators or what criteria were used) to assure readers of consistency. Overall, the experimental design is solid. The necessary details (model used, prompt design, parameter settings like temperature=0 for reproducibility) are all provided, and the authors have made their code publicly available, which aids reproducibility. The methodology is transparent and should allow others to replicate or build upon the work. Support for Conclusions by Data: The conclusions are, for the most part, well supported by the data presented. In the cell line extraction task, the LLM-based method clearly outperforms the traditional MetaSRA pipeline in both accuracy and coverage (Table 4). For example, the LLM pipeline achieved substantially higher coverage (93.0% vs 72.1% for MetaSRA) without sacrificing accuracy (~92.3% vs 90.3%), and it also showed improved precision in identifying non-cell line samples. These results validate the authors' claim that LLMs can more flexibly and comprehensively interpret metadata, mapping many more actual cell line samples to ontology terms while maintaining low false-positive rates. The data support the conclusion that the LLM approach enhances metadata findability (since far more samples get correctly annotated) and does so with high reliability. The authors appropriately note that the conventional method's conservative strategy yields high precision at the cost of leaving many samples unmapped, whereas the LLM can confidently map a greater portion of samples. This finding is well substantiated by the numbers and the error analysis in Table 5 (which categorizes the few failure cases of the LLM, such as confusion with derivative cell lines or missing a cell line when certain keywords were absent). In the gene name extraction task, the authors report that the LLM identified at least one gene in 600 out of 3,723 tested samples, with an overall accuracy of ~80.3% for those outputs (about 91.6% accuracy on gene names themselves, and 84.7% on the associated modulation method). This demonstrates that the LLM can successfully parse complex descriptions to find gene perturbations in a majority of cases. While there is no baseline for direct comparison here, these results are consistent with the idea that LLMs can extend curation to new information types not yet curated (in this case, finding manipulated genes where an ontology or curated list didn't exist). The authors' conclusions about the utility of this - for example, that it could allow users to filter out experiments with gene knockouts/knockdowns to avoid confounding effects - are reasonable extrapolations from the data. The discussion correctly notes that coverage for this gene task wasn't evaluated (since no gold standard exists) and acknowledges that some fraction of relevant cases might be missed. All major conclusions (LLM outperforms rule-based methods; LLM extraction of new metadata is feasible and useful) are backed by the evidence provided. The authors also contextualize their findings by noting limitations and practical considerations (e.g. the processing throughput of ~400 samples/hour and the challenge of scaling to 40 million records). This adds credibility to their interpretation that LLM-based curation will need further resources or model improvements to handle the entire database. In summary, the data presented are analyzed in depth (with relevant tables, figures, and a breakdown of error types), and they support the paper's conclusions well. I have no concerns that the authors are overstating their results. Language Clarity and Quality: The manuscript is written in generally clear and professional English. The authors note that they translated the draft from Japanese with assistance from ChatGPT, and the result is readable and scientifically appropriate. The overall clarity is good - important terms are defined, and the narrative flows logically from the motivation to methods, results, and discussion. I did not encounter ambiguities that impede understanding of the science. There are only a few minor issues in language usage and grammar that require attention. For example, there is a small typo in the description of gene overexpression ("achieved by trasfection of a plasmid…" on page 19) - "trasfection" should be "transfection" (unless this typo was carried over from the original prompt). Another example is the sentence "the outcomes of this study can handle these errors to rescue the affected published data for further use," which is a bit awkward in phrasing - perhaps reword to clarify that the methods developed can help correct metadata errors from submitted data. These are relatively minor edits; the manuscript does not require heavy language revision, just light editing for a few misspellings and stylistic "smoothing". The structure of the paper is appropriate, with a clear Introduction and well-labeled sections (Methods, Results/Discussion, Limitations, etc.). Data presentation is also clear: figures and tables are easy to interpret, and captions are explanatory. For example, the flowchart in Figure 2 and the definitions in Figure 3 clearly help in the understanding of the pipeline and metrics. In summary, with minor editorial changes, the quality of language and presentation will be suitable for publication. Statistical Analysis and Data Presentation: I am able to assess all the statistics and quantitative analyses in the manuscript, and they appear appropriate. The study primarily uses descriptive performance metrics (accuracy, coverage, precision, recall) to evaluate the extraction tasks - these are standard and well defined (the text and Figure 3 provide clear definitions of each metric in the context of the task). The comparisons between the LLM pipeline and the MetaSRA pipeline are straightforward to interpret. The authors did not perform complex statistical tests (e.g., no p-values are reported), which can be justified given that the magnitude and consistency of the improvements are evident and the evaluation emphasizes practical performance metrics rather than hypothesis testing. However, the manuscript states in Supplementary Table 1 that "no significant differences were observed" between ChIP-seq and ATAC-seq subsets. If the authors intend "significant" to indicate statistical significance, it would be necessary to include the specific statistical test used along with associated test statistics and p-values to substantiate this claim. If no formal statistical testing was conducted, it would be more accurate and clearer to rephrase this as a qualitative observation rather than implying formal statistical support. All underlying data needed to interpret the results are provided either in the main figures/tables or supplementary material. The presentation of results is clear and transparent: Table 4 quantitatively summarizes the performance of each pipeline, and Table 5 qualitatively categorizes the errors made by the LLM. I have no other concerns about the appropriateness of statistical methods used - the evaluation metrics are suitable for information extraction tasks, and the sample sizes (600 samples for the cell line task, and thousands scanned for the gene task) are adequate to support the conclusions. In terms of data transparency, the manuscript indicates that outputs and code are available (with a GitHub repository provided), which will allow others to reproduce the analysis. Additional comments and suggestions: Beyond the points above, I have a few minor suggestions to further strengthen the manuscript. First, it would be helpful if the authors could clarify in the Methods how the manual evaluation of gene name extraction was performedâ€”for example, whether multiple curators independently reviewed the outputs or if any consensus procedure was employed to resolve ambiguous cases. Providing this detail would add transparency to the accuracy figures reported, although the existing explanation about handling ambiguous cases (e.g., fusion genes) is already helpful. Second, given the manuscript's emphasis on a zero-shot LLM approach, it would be beneficial for the authors to briefly discuss whether alternative strategies, such as fine-tuning smaller language models, were considered. This would more clearly position the study within the broader landscape of metadata curation techniques. Third, the authors describe the use of the locally deployed Llama 3.1 model and emphasize its advantages regarding data privacy and scalability. Since these benefits are significant for practical adoption, it would further strengthen the manuscript if the authors explicitly highlight practical considerations, such as specific hardware requirements (in addition to the graphics card usage already included) and runtime performance benchmarks. Finally, as mentioned earlier, the authors mention in Supplementary Table 1 that "no significant differences were observed" between ChIP-seq and ATAC-seq samples. If the term "significant" here is meant to indicate statistical significance, please include details of the specific statistical test and associated values (e.g., test statistics and p-values) that substantiate this conclusion. If no formal statistical testing was performed, it would be more appropriate to rephrase this statement to indicate a qualitative observation rather than imply statistical testing. These points are relatively minor and do not indicate fundamental issues with the manuscript. Recommendation: In summary, this is a strong manuscript that addresses a pertinent problem in biological data management using modern LLM tools. The methods are sound and well controlled, the results are convincing, and the authors have been appropriately cautious and thorough in their analysis. I recommend minor revisions for this manuscript. The revisions needed are primarily editorial (minor language fixes and clarifications), with one note about statistics, and do not require additional experiments. With those addressed, the work should be suitable for publication in GigaScience.
2. GigaScience 08 Jul 2025
  
  in GigaScience
  
  BioSample is a comprehensive repository of experimental sample metadata, playing a crucial role in providing a comprehensive archive and enabling experiment searches regardless of type. However, the difficulty in comprehensively defining the rules for describing metadata and limited user awareness of best practices for metadata have resulted in substantial variability depending on the submitter. This inconsistency poses significant challenges to the findability and reusability of the data. Given the vast scale of BioSample, which hosts over 40 million records, manual curation is impractical. Rule-based automatic ontology mapping methods have been proposed to address this issue, but their effectiveness is limited by the heterogeneity of BioSample metadata. Recently, large language models (LLMs) have gained attention in natural language processing and have been expected as promising tools for automating metadata curation. In this study, we evaluated the performance of LLMs in extracting cell line names from BioSample descriptions using a gold-standard dataset derived from ChIP-Atlas, a secondary database of epigenomics experiment data, which manually curates samples. Our results demonstrated that LLM-assisted methods outperformed traditional approaches, achieving higher accuracy and coverage. We further extended this approach to extraction of information about experimentally manipulated genes from metadata where manual curation had not yet been applied in ChIP-Atlas. This also yielded successful results for the usage of the database, which facilitates more precise filtering of data and prevents misinterpretation caused by inclusion of unintended data. These findings underscore the potential of LLMs to improve the findability and reusability of experimental data in general, significantly reducing user workload and enabling more effective scientific data management.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf070 ), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer: Sajib Acharjee Dip 1. The gold-standard dataset constructed for evaluation, though carefully validated by experts, was limited to 600 samples (300 ChIP-seq and 300 ATAC-seq). Such a limited scope may introduce selection bias or fail to capture the full variability present across the entire BioSample database (>40 million records). It is unclear how representative these samples are of real-world metadata submissions.Clearly demonstrate the representativeness of the sample selection or increase sample size to better represent BioSample's diversity.
  
  The manuscript predominantly compares the proposed LLM-based approach to the MetaSRA pipeline. While MetaSRA is a relevant baseline, the omission of comparisons with other contemporary methods like ChIP-GPT, and Bioformer is a notable oversight. These tools represent significant advancements in the field and have demonstrated efficacy in tasks closely related to the study's objectives. A comprehensive evaluation against these methods or comparative discussions would provide a clearer understanding of the proposed approach's relative performance and contributions. https://academic.oup.com/bib/article/25/2/bbad535/7600389 https://pmc.ncbi.nlm.nih.gov/articles/PMC10029052/
  
  "LLM-assisted methods outperformed traditional approaches, achieving higher accuracy and coverage." While the study reports improved performance over MetaSRA, the absence of comparisons with other SOTA methods renders this assertion less robust. Without such comparative analyses, it's challenging to attribute the observed improvements solely to the proposed approach.â€‹ Rephrasing claims to accurately reflect the scope of the comparisons made would strengthen clarity.
  
  Despite high accuracy, complex cases (fusion proteins, inhibitors mentioned indirectly, ambiguous terminology) were recognized as difficult, yet were excluded from primary accuracy evaluations. By excluding these ambiguous cases from performance metrics, the accuracy results might be artificially improved. Provide additional metrics that include these complex or ambiguous cases, clearly quantifying performance drops. This would offer more realistic insights into real-world applicability.
  
  The error categorization provided (derivation issues, overlooked terms, selection failures, etc.) is helpful, but somewhat superficial. The deeper root causesâ€”such as the LLM's lack of biological context knowledge, tokenization errors, or prompt ambiguityâ€”were not thoroughly explored or explained. Discuss or perform deeper qualitative analysis on specific error instances, highlighting precisely why the LLM made incorrect decisions (e.g., lack of biological understanding, misinterpretation of abbreviations, limitations of prompt wording).
  
  Temperature settings were fixed at zero for deterministic outputs. While deterministic settings are valuable for reproducibility, exploring or reporting the effect of temperature variations on accuracy and robustness would have strengthened this methodological choice significantly.
  
  The authors have not sufficiently explored or justified their prompt engineering choices which are critical for reproducibility and optimization. I recommend providing additional experiments or discussions on alternative prompting strategies tested, including prompt variants that failed and reasons why particular prompts were selected.
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2025.02.17.638570v1
www.biorxiv.org www.biorxiv.org

CODARFE: Unlocking the prediction of continuous environmental variables based on microbiome

1
1. GigaScience 08 Jul 2025
  
  in GigaScience
  
  Despite the surge in data acquisition, there is a limited availability of tools capable of effectively analyzing microbiome data that identify correlations between taxonomic compositions and continuous environmental factors. Furthermore, existing tools also do not predict the environmental factors in new samples, underscoring the pressing need for innovative solutions to enhance our understanding of microbiome dynamics and fulfill the prediction gap. Here, we introduce CODARFE, a novel tool for sparse compositional microbiome-predictors selection and prediction of continuous environmental factors. We tested CODARFE against four state-of-the-art tools in two experiments. First, CODARFE outperformed predictor selection in 21 out of 24 databases in terms of correlation. Second, among all the tools, CODARFE achieved the highest number of previously identified bacteria linked to environmental factors for human data—that is, at least 7% more. We also tested CODARFE in a cross-study, using the same biome but under different external effects (e.g., ginseng field and cattle for arable soil, and HIV and crohn’s disease for human gut), using a model trained on one dataset to predict environmental factors on another dataset, achieving 11% of mean absolute percentage error. Finally, CODARFE is available in five formats, including a Windows version with a graphical interface, to installable source code for Linux servers and an embedded Jupyter notebook available at MGnify - https://github.com/alerpaschoal/CODARFE.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf055), which carries out open, named peer-review. The following review is published under a CC-BY 4.0 license:
  
  Reviewer: Jaak Truu
  
  This manuscript addresses key aspects of microbiome data analysis, particularly in relating continuous variables to microbiome data and utilizing microbiome data to predict variables of interest. The data analysis approach is well-articulated; however, there is a notable omission regarding the derivation of the microbiome datasets. While the sources of these datasets are mentioned, it remains unclear whether the authors processed the initial data to produce the count tables used as input or if these tables were directly adopted from the original publications. Given that the data in the main text are derived from studies based on 16S rDNA sequencing, variations in data processing pipelines between publications could introduce significant variability. Although the manuscript discusses the importance of the sequenced 16S rDNA region and the similarity of the environments from which the samples were obtained, it does not address the impact of the initial data processing pipeline (including taxonomy assignment).
  
  Additionally, the number of samples in each dataset is not provided in the tables.
  
  The manuscript includes a comparison of the proposed method with other tools; however, it omits MaAsLin (Microbiome Multivariable Association with Linear Models), that has been applied far more extensively in microbiome data analysis than the tools included in the current manuscript. Incorporating a comparison with MaAsLin would enhance the comprehensiveness of the evaluation.
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2024.07.18.604052v1
www.biorxiv.org www.biorxiv.org

The FIP 1.0 Data Set: Highly Resolved Annotated Image Time Series of 4,000 Wheat Plots Grown in Six Years

2
1. GigaScience 08 Jul 2025
  
  in GigaScience
  
  Background Understanding genotype-environment interactions of plants is crucial for crop improvement, yet limited by the scarcity of quality phenotyping data. This data note presents the Field Phenotyping Platform 1.0 data set, a comprehensive resource for winter wheat research that combines imaging, trait, environmental, and genetic data.Findings We provide time series data for more than 4,000 wheat plots, including aligned high-resolution image sequences totaling more than 153,000 aligned images across six years. Measurement data for eight key wheat traits is included, namely canopy cover values, plant heights, wheat head counts, senescence ratings, heading date, final plant height, grain yield, and protein content. Genetic marker information and environmental data complement the time series. Data quality is demonstrated through heritability analyses and genomic prediction models, achieving accuracies aligned with previous research.Conclusions This extensive data set offers opportunities for advancing crop modeling and phenotyping techniques, enabling researchers to develop novel approaches for understanding genotype-environment interactions, analyzing growth dynamics, and predicting crop performance. By making this resource publicly available, we aim to accelerate research in climate-adaptive agriculture and foster collaboration between plant science and machine learning communities.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf051), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer: Wanneng Yang
  
  The manuscript presents a comprehensive dataset spanning six years, encompassing data from eight key growth stages of wheat, along with corresponding phenotypic data. The construction of such a comprehensive dataset is highly valuable. However, from the perspective of dataset construction itself, quality control and consistency checks require further refinement. Specific issues are as follows:
  
  How is the consistency check of parameters such as canopy cover and plant height at the eight key growth stages ensured? Especially for parameters like phenological stages and senescence assessment, which are determined through visual evaluation and thus susceptible to subjective influences, quality control and consistency check become particularly crucial. It is recommended to supplement relevant content for detailed explanation.
  
  For all images (151,150 out of 158,891 images), the success rate of alignment and within-field detection exceeded 95%. Does this mean that the final RGB sequence image dataset consists of 151,150 images?
  
  Regarding plant height measurement, the text mentions that "TLS (2016, 2017) or UAV (2018 to 2022) was used to measure plant height." Given the potential differences in height measurements obtained from these two methods, how were these differences addressed in the manuscript?
  
  Does this dataset cater to different tasks and include annotated data? If so, it is recommended to specify the concrete annotation methods and data.
  
  If possible, it is recommended to provide a summary table that specifies the different types of data contained in the dataset along with their respective quantities, facilitating readers' comprehensive understanding of the dataset.
  
  What are the potential limitations of this dataset? It is recommended to point them out.
2. GigaScience 08 Jul 2025
  
  in GigaScience
  
  Background Understanding genotype-environment interactions of plants is crucial for crop improvement, yet limited by the scarcity of quality phenotyping data. This data note presents the Field Phenotyping Platform 1.0 data set, a comprehensive resource for winter wheat research that combines imaging, trait, environmental, and genetic data.Findings We provide time series data for more than 4,000 wheat plots, including aligned high-resolution image sequences totaling more than 153,000 aligned images across six years. Measurement data for eight key wheat traits is included, namely canopy cover values, plant heights, wheat head counts, senescence ratings, heading date, final plant height, grain yield, and protein content. Genetic marker information and environmental data complement the time series. Data quality is demonstrated through heritability analyses and genomic prediction models, achieving accuracies aligned with previous research.Conclusions This extensive data set offers opportunities for advancing crop modeling and phenotyping techniques, enabling researchers to develop novel approaches for understanding genotype-environment interactions, analyzing growth dynamics, and predicting crop performance. By making this resource publicly available, we aim to accelerate research in climate-adaptive agriculture and foster collaboration between plant science and machine learning communities.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf051), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer: Abhishek Gogna
  
  Thank you for the submission. The dataset surely holds value for the plant breeding community but my major concerns are (1) the availability of genetic data, (2) non-conformity to MIAPPE standards (https://www.miappe.org/). These restrict value of the otherwise excellent publication. I would welcome a submission addressing these major points. In addition, I have some minor points for specific sections. Please use the strings in quotation marks ("") to locate the specific sections.
  
  Context Change of Equipment: Please indicate how the change of equipment from TLS to drone affects data interoperability. "Figure 2, gray bars": Kindly update Figure 2 to clarify the representation of the gray bars.* "Heads were annotated": Does this mean that not all relevant images were annotated? If so, please modify the title to avoid confusion.
  
  Description of FAIR: Please revise this section. Both links listed under "Findable" and "Accessible" are eligible for these tags. Please modify "Interoperability" with reference to the publication listed in the "Re-use Potential."
  
  Reference measurements "Senescence was": Was this measurement done for all relevant images? Please include this information. "Adjusted genotype means with year calculation": Please add variance decomposition data for traits.
  
  3. Compilation as Data set* "pure GABI-WHEAT set for the extended set": Please revise this sentence for clarity.
  
  Heritabilities of intermediate and target traits* "y of the public marker" - Please revise the sentence for clarity.
  
  Genomic prediction ability of unseen multi-environment trial* Is the CDC data part of the data publication? Please add this information.6. Example 1 to
  
  6* Please revise all code for consistency and updated results. Also, include the necessary packages required to run the code.7. Availability of Source code and RequirementPlease create connectivity between repositories and add descriptive README files outlining their usage. Additionally, please provide instructions on how individual repositories may be used.I appreciate your attention to these points and believe that addressing them will strengthen your manuscript
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2024.10.04.616624v3
www.biorxiv.org www.biorxiv.org

Chromosome-level genome assembly and methylome profile enables insights for the conservation of endangered loggerhead sea turtles

3
1. GigaScience 08 Jul 2025
  
  in GigaScience
  
  Background Characterising genetic and epigenetic diversity is crucial for assessing the adaptive potential of populations and species. Slow-reproducing and already threatened species, including endangered sea turtles, are particularly at risk. Those species with temperature-dependent sex determination (TSD) have heightened climate vulnerability, with sea turtle populations facing feminisation and extinction under future climate change. High- quality genomic and epigenomic resources will therefore support conservation efforts for these flagship species with such plastic traits.Findings We generated a chromosome-level genome assembly for the loggerhead sea turtle (Caretta caretta) from the globally important Cabo Verde rookery. Using Oxford Nanopore Technology (ONT) and Illumina reads followed by homology-guided scaffolding, we achieved a contiguous (N50: 129.7 Mbp) and complete (BUSCO: 97.1%) assembly, with 98.9% of the genome scaffolded into 28 chromosomes and 29,883 annotated genes. We then extracted the ONT-derived methylome and validated it via whole genome bisulfite sequencing of ten loggerheads from the same population. Applying our novel resources, we reconstructed population size fluctuations and matched them with major climatic events and niche availability. We identified microchromosomes as key regions for monitoring genetic diversity and epigenetic flexibility. Isolating 191 TSD-linked genes, we further built the largest network of functional associations and methylation patterns for sea turtles to date.Conclusions We present a high-quality loggerhead sea turtle genome and methylome from the globally significant East Atlantic population. By leveraging ONT sequencing to create genomic and epigenomic resources simultaneously, we showcase this dual strategy for driving conservation insights into endangered sea turtles.
  
  This work has been peer reviewed in GigaScience (https://doi.org/10.1093/gigascience/giaf054), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer: F Gözde Çilingir
  
  In this study, the authors generated a high-quality chromosome-level genome assembly and methylome for the loggerhead sea turtle (Caretta caretta) using a combination of Oxford Nanopore Technology (ONT) and Illumina sequencing. They also examined population size fluctuations, identified microchromosomes as key areas for monitoring genetic diversity and epigenetic flexibility, and focused on genes linked to temperature-dependent sex determination (TSD), with additional datasets from 10 individuals using whole-genome bisulfite sequencing (WGBS).The study consists of three key parts: 1) genome sequencing and assembly, 2) benchmarking ONT methylation calls with WGBS, and 3) epigenetic patterning of TSD-linked genes, which was contextualized for future studies. The first part certainly includes relatively novel genomic resources that will provide valuable tools for conservation and population genomics. It's encouraging to see the use of DNA modification detection via ONT, with a comprehensive analysis of 5mC and 5hmC methylomes alongside genomesâ€”especially for chelonians, a group that is underrepresented among available vertebrate genomes. Benchmarking ONT methylation calls with WGBS is also relevant for the field (though some clarifications on the experimental design are necessary). However, I have several concerns regarding the biological rationale of certain study design choices and the conclusions drawn by the authors regarding the TSD-linked genes' methylation patterns.Overall, this study provides valuable genomic resources for loggerhead sea turtles. However, some of the biological assumptions and study design choices regarding the methylation patterning require further clarification and a more robust discussion to ensure that the conclusions drawn can be supported by the data produced.Detailed comments to the authorsABSTRACTThe abstract states: "Isolating 191 TSD-linked genes, we further built the largest network of functional associations and methylation patterns for sea turtles to date."Throughout the manuscript, this number changes. Please double-check and ensure consistency in the number of TSD-linked genes reported.BACKGROUNDI suggest using the phrase "a skew toward female-biased sex ratios" instead of "feminisation" throughout the text for a clearer and more neutral description of the biological phenomenon. For example, the third sentence of the second paragraph could be revised as:"As multiple theoretical studies have predicted a significant skew toward female-biased sex ratios and subsequent population collapse by 2100 in response to future climate scenarios."METHODSPage 5, DNA extraction, sequencing, and quality control - first paragraph:ONT kit chemistry numbers and flow cell types can be confusing for readers. Could you also clarify that the SQK-LSK109 kit used is associated with R9.4.1 flow cells, indicating the sequencing error profile of the technology?Regarding the Phred score >Q8 cutoff: Q8 corresponds to a sequencing error rate of ~15-16%. Could you clarify the reasoning behind choosing this cutoff? Citing similar studies that have used this threshold would add support to your decision.Page 8: I couldn't find the de novo assembled transcriptomes in the ENA or GigaDB repositories. Are these data publicly available? If so, it would be beneficial to provide the location.Page 9, ONT methylation call and validation with WGBS:There's a discrepancy between the retained CpGs: you mention "26,449,075 CpGs" in one place and later report different numbers in the results section. Please clarify these numbers and ensure consistency.It would be helpful to include a table summarizing key metrics of the ONT methylation call, such as mean/median CpG site coverage, similar to Table S3.Page 9, second paragraph: You mention "Ten nesting loggerheads." Please specify that these are ten adult loggerhead females for clarity. Additionally, correct the table references: Table S3 should be Table S2, Table S4 should be Table S3, etc.RESULTS AND DISCUSSIONGenome AssemblyFigure 1B: While Table 1 effectively illustrates the differences in contiguity levels, Figure 1B doesn't add much due to the difficulty in distinguishing closely aligned lines. If you retain the figure, I suggest using more contrastive colors to improve readability.Genome Annotation: I agree that the lack of a pre-determined training parameter set for chelonians within the BRAKER pipeline leads to relatively incomplete gene model predictions. However, lifting over gene models from other sea turtle genomes and combining them with predictions (again using TSEBRA) would likely improve the overall completeness of the annotations.Methylation Call and ValidationYou state, "To verify our ONT methylation call, we compared calls with ten loggerhead methylomes re-sequenced via WGBS." Does this mean you generated an ONT methylome from a single individual and compared it to the average methylation levels from ten different individuals obtained with WGBS? If so, this may not be an ideal benchmarking strategy. Generating both ONT and WGBS data for all individuals would provide a more robust comparison. Clarifying this design would help the reader understand the validation process better. Additionally, consider citing relevant benchmarking studies.In the last paragraph of this section, you highlight ONT as a robust alternative to WGBS but then use WGBS for the TSD-linked gene analysis. This appears somewhat contradictory. It might be useful to explain why WGBS was favored in this part of the analysis.Genome Properties: Figures 3C-F were difficult to read to me (low resolution), and they don't seem directly related to Figures 3A and 3B. I suggest separating these figure groups for better clarity. Additionally, it would be helpful to report or visualize the repeat content of both micro and macro chromosomes. Long-read sequencing assemblies are particularly effective at resolving repeat-rich regions, and microchromosomes are often repeat-rich. Highlighting this aspect would demonstrate the added value of long-read sequencing for assembling reference genomes of organisms like sea turtles.TSD-linked genes: methylation patternsTesting methylation differences between TSD-linked and non-TSD-linked genes focusing on specific regulatory regions is potentially informative, but the biological rationale for expecting consistent differences between these two groups is unclear. TSD-linked genes are involved in dynamic, environmentally responsive processes, whereas non-TSD-linked single-copy orthologues (as used in the study) typically represent essential, evolutionarily conserved functions with more stable methylation patterns. The use of single-copy orthologues as a control set is problematic because these genes could serve fundamentally different roles. A more relevant comparison would be between TSD-linked genes and other genes involved in similarly dynamic, environmentally responsive pathways.Additionally, all methylation data come from adult female blood (N=10, all from the same beach), which may not be the most appropriate approach for studying TSD, a process that primarily occurs during embryonic development, when temperature cues influence sex determination. Methylation patterns in adults may no longer reflect the active regulatory processes that control TSD during embryogenesis. In other words, adult methylation patterns could be influenced by factors such as reproductive status or aging, and may not reflect the regulation of TSD-linked genes during key developmental stages. These limitations/points should be addressed.CONCLUSIONSThe manuscript would benefit from a discussion of how biological context (such as developmental stage) affects the interpretation of methylation patterns in this study.It is also worth mentioning that both ONT and WGBS require substantial amounts of input DNA, and blood samples from reptiles are ideal because of their nucleated red blood cells-this could be acknowledged as a practical advantage somewhere in the text.SUPPLEMENTARY INFOCould you explain what "DMS" refers to in Text S3? This term isn't defined in the manuscript.There are two Figure S7, please change the last one to Figure S8.SUPPORTING DATAThe FTP server data look good, but I couldn't find the de novo transcriptomes. Some files have long, confusing namesâ€”adding a README file in each directory would help clarify the contents.Important note: It would be helpful to include line numbers in the manuscript to facilitate direct and effective feedback.
2. GigaScience 08 Jul 2025
  
  in GigaScience
  
  Background Characterising genetic and epigenetic diversity is crucial for assessing the adaptive potential of populations and species. Slow-reproducing and already threatened species, including endangered sea turtles, are particularly at risk. Those species with temperature-dependent sex determination (TSD) have heightened climate vulnerability, with sea turtle populations facing feminisation and extinction under future climate change. High- quality genomic and epigenomic resources will therefore support conservation efforts for these flagship species with such plastic traits.Findings We generated a chromosome-level genome assembly for the loggerhead sea turtle (Caretta caretta) from the globally important Cabo Verde rookery. Using Oxford Nanopore Technology (ONT) and Illumina reads followed by homology-guided scaffolding, we achieved a contiguous (N50: 129.7 Mbp) and complete (BUSCO: 97.1%) assembly, with 98.9% of the genome scaffolded into 28 chromosomes and 29,883 annotated genes. We then extracted the ONT-derived methylome and validated it via whole genome bisulfite sequencing of ten loggerheads from the same population. Applying our novel resources, we reconstructed population size fluctuations and matched them with major climatic events and niche availability. We identified microchromosomes as key regions for monitoring genetic diversity and epigenetic flexibility. Isolating 191 TSD-linked genes, we further built the largest network of functional associations and methylation patterns for sea turtles to date.Conclusions We present a high-quality loggerhead sea turtle genome and methylome from the globally significant East Atlantic population. By leveraging ONT sequencing to create genomic and epigenomic resources simultaneously, we showcase this dual strategy for driving conservation insights into endangered sea turtles.
  
  This work has been peer reviewed in GigaScience (https://doi.org/10.1093/gigascience/giaf054), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer: Victor Quesada
  
  This work offers a in improved version of the reference genome for the loggerhead sea turtle. The authors have also analyzed the methylation patterns of blood obtained from different individuals and with two methods. The resulting data set includes gene annotations, methylation levels and the specific analysis of methylation levels of genes involved in temperature-dependent sex determination (TSD). While the improvements offered by this work seem modest, I think that the data sets may provide important resources for future works.-In my opinion, the use of a previous version of the same genome in the assembly process should be noted in the abstract. It would be enough to write "... followed by homolgy-guided scaffolding to GSC_CCare_1.0...".-If possible, the authors should clarify the taxonomic relationship between the reference individual in this work and the reference individual for the previous version of the genome (ref. 26). Is it the same NCBI taxid?-There is a mention to "lateral terminal repeats" at the "Genome annotation" section (page 7). I think it is a typo and it should read "long terminal repeats".-In the same section, at page 9, reference 73 refers to StringTie, not gffread. In addition, it is not clear how "in-frame stop codons were removed". A simple way to unambiguously explain this would be to provide the options that were used, as with other programs.-I would revise the use of "coverage" versus "depth". For instance, the expression "...a coverage of 9.2(...)X" would be more precise as "...a sequencing depth of 9.2(...)X". Coverage should be a fraction or a percentage. However, this is only a piece of advice, as there is no strong consensus at the moment.-The interpretation of methylation patterns is always difficult. In my opinion, the manuscript should discuss several limitations about the results:First, using blood as the starting tissue is convenient but not ideal, as many methylation patterns are tissue-specific. The authors may want to add a reference to preliminary evidence that some methylation changes in blood cells are related to TSD (Bock et al., Mol Ecol. 2022; 31:5487-5505).Second, the work examines broad patterns of methylation (all promoters, all coding sequences,...). While this may be interesting for descriptive purposes, it may also drown significant signals. The manuscript should mention this limitation.*Figure 2B shows methylation per gene. If the aim is to compare both kinds of sequencing, there should be at least one comparison of methylation per CpG, which might even be cathegorial or downsampled.-The origin of the duplication of EP300 seems outside the scope of the manuscript. Nevertheless, given that the question is posed, the authors may want to perform a simple phylogenetic analysis of the sequences. Even the basic analysis of the annotated copies plus an outgroup is likely to give a robust answer to this question.-For the benefit of non-specialists, the manuscript might include a brief mention of how microchromosomes allow a larger number of combinations of variants without chromosome recombination.-Some expressions may be edited for clarity and precission. Examples are "which should be verified whether they are true" (page 17) and "microchromosomes have greater methylation potential and realised levels...".
3. GigaScience 08 Jul 2025
  
  in GigaScience
  
  Background Characterising genetic and epigenetic diversity is crucial for assessing the adaptive potential of populations and species. Slow-reproducing and already threatened species, including endangered sea turtles, are particularly at risk. Those species with temperature-dependent sex determination (TSD) have heightened climate vulnerability, with sea turtle populations facing feminisation and extinction under future climate change. High- quality genomic and epigenomic resources will therefore support conservation efforts for these flagship species with such plastic traits.Findings We generated a chromosome-level genome assembly for the loggerhead sea turtle (Caretta caretta) from the globally important Cabo Verde rookery. Using Oxford Nanopore Technology (ONT) and Illumina reads followed by homology-guided scaffolding, we achieved a contiguous (N50: 129.7 Mbp) and complete (BUSCO: 97.1%) assembly, with 98.9% of the genome scaffolded into 28 chromosomes and 29,883 annotated genes. We then extracted the ONT-derived methylome and validated it via whole genome bisulfite sequencing of ten loggerheads from the same population. Applying our novel resources, we reconstructed population size fluctuations and matched them with major climatic events and niche availability. We identified microchromosomes as key regions for monitoring genetic diversity and epigenetic flexibility. Isolating 191 TSD-linked genes, we further built the largest network of functional associations and methylation patterns for sea turtles to date.Conclusions We present a high-quality loggerhead sea turtle genome and methylome from the globally significant East Atlantic population. By leveraging ONT sequencing to create genomic and epigenomic resources simultaneously, we showcase this dual strategy for driving conservation insights into endangered sea turtles.
  
  This work has been peer reviewed in GigaScience (https://doi.org/10.1093/gigascience/giaf054), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer: Zhongduo Wang The study presents high-quality genomic and methylomic data for loggerhead sea turtles, serving as a significant resource for further genomic and epigenomic research on this species. Notably, this is the first methylome derived from a sea turtle using ONT technology, offering a new, reliable method for studying the epigenetic characteristics of non-model organisms. Moreover, by integrating genomic and methylomic data, the authors analyze the functionality and methylation patterns of TSD-related genes, contributing fresh perspectives to the molecular mechanisms underlying TSD. While the study offers valuable data, there are several areas that could be enhanced.1) Lack of Reference to Hawksbill Turtle Genome: The manuscript does not discuss any information regarding the hawksbill turtle genome. Given that hawksbills also published a comparative analysis of the loggerhead's genomic data, I recommend that the authors include relevant information or clarify why hawksbill data was not considered.2) Further Optimization of Genome Annotation: The authors acknowledge that the completeness of the genome annotation requires enhancement and mention future improvements such as species-specific parameter adjustments and manual curation. While it is understandable that time and resource constraints may have limited these optimizations prior to submission, it would be beneficial for the authors to clarify the reasons for this and outline a timeline for future enhancements.3) Information on Individual Variability in WGBS Results: The manuscript lacks specific information on inter-individual variability among the ten individuals in the WGBS data. I suggest that the authors consider adding this analysis or provide justification for its absence. If significant variability exists among individuals, averaging the methylomic data could obscure important biological information.4) Clarification on Statistical Tests and Data Processing: The manuscript employs several statistical tests such as t-tests, Ftests, and chi-squared tests. However, the methods section lacks detailed information on how the data was processed for these analyses. I recommend that the authors provide a more thorough explanation of the data preparation steps, assumptions checked, and justification for the choice of tests.In summary, this manuscript makes a significant contribution to the study of loggerhead turtle genomics and methylomics. Addressing the aforementioned points could further enhance the quality and impact of the work.
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2024.08.28.610089v1
www.biorxiv.org www.biorxiv.org

Chromosome-level reference genome for the medically important Arabian horned viper (Cerastes gasperettii)

3
1. GigaScience 08 Jul 2025
  
  in GigaScience
  
  Venoms have traditionally been studied from a proteomic and/or transcriptomic perspective, often overlooking the true genetic complexity underlying venom production. The recent surge in genome-based venom research (sometimes called “venomics”) has proven to be instrumental in deepening our molecular understanding of venom evolution, particularly through the identification and mapping of toxin-coding loci across the broader chromosomal architecture. Although venomous snakes are a model system in venom research, the number of high-quality reference genomes in the group remains limited. In this study, we present a chromosome-resolution reference genome for the Arabian horned viper (Cerastes gasperettii), a venomous snake native to the Arabian Peninsula. Our highly-contiguous genome allowed us to explore macrochromosomal rearrangements within the Viperidae family, as well as across squamates. We identified the main highly-expressed toxin genes compousing the venom’s core, in line with our proteomic results. We also compared microsyntenic changes in the main toxin gene clusters with those of other venomous snake species, highlighting the pivotal role of gene duplication and loss in the emergence and diversification of Snake Venom Metalloproteinases (SVMPs) and Snake Venom Serine Proteases (SVSPs) for Cerastes gasperettii. Using Illumina short-read sequencing data, we reconstructed the demographic history and genome-wide diversity of the species, revealing how historical aridity likely drove population expansions. Finally, this study highlights the importance of using long-read sequencing as well as chromosome-level reference genomes to disentangle the origin and diversification of toxin gene families in venomous species.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf030 ), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer Hardip Patel
  
  Dear Authors, thank you for compiling this resource and the manuscript. I apologise for the delay in my review. I have read the manuscript with great interest. I have some major concerns that need be addressed and a lot of minor concerns. Without line numbers, it was difficult to provide comments. I have chosen to write the part of the sentence that my comment refers to for you to consider for improvements.
  
  Major concerns:
  
  Abstract can include quantitative values for some key results such as the genome size, contiguity (e.g.N50, L90) and quality metrics (e.g. BUSCO) of the genome assembly among other result claims listed in the abstract. Venom as the keyword can perhaps be described/defined. Authors interchangeably use "venom", "toxin", "venom toxin", genes coding venom proteins. I strongly suggest the use of consistent terminologies that are well defined in the manuscript. Methods need elaborate descriptions about reagents, procedures including for library preparations, sequencing machines, library kits and versions, etc. These are relevant for downstream analyses. For all software, list parameters used, even if default, then explicitly state that "default parameters were used". For all software, list version numbers used for analyses. Authors are urged to change "macorsynteny" and "microsynteny" terms to chromosome level and local synteny analyses. This is to avoid confusion related to macro/microchromosomes. "Genomic diversity" analyses use cross-species alignments and variant calling using software and methods developed for same species data. This can introduce significant bias in downstream interpretation and use of the variant data (heterozygosity measure may be). I suggest removal of this section because of lack of accuracy. Discussion of new discovery is largely lacking. I would appreciate if authors contextualized their results with other discoveries in the field. Section headings in Results and Discussions can be changed to reflect main findings instead of "transcriptomics" or "genomic diversity". One of the main findings is about SVMP gene family expansion. However, due to the lack of evidence about assembly accuracy in the region, accurate annotation of copies, and the effect of studying "primary assembly" instead of "haplotype assembly" at this region, I am not convinced of claims made in the paper. Appropriate justification is required for this section. The nomenclature of SVMP genes is confusing. For example, In Figure 4A, they are all labelled as SVMPs with different colours, but then they are labelled as MDCs and MADs in Figure 4b and Supp Figure 6. Please label each gene in each species with consistent names that can reflect orthologous relationship. This is hard to discern, especially without appropriate species labels in Supp Figure 6. Provide MSA files and trees used to infer evolutionary history. In the absence of the sequence alignments, and raw tree file, I am unable to evaluate this section of the manuscript. Please provide all required details for reviewers and readers. ??: It is not clear what authors mean by the word, term, phrase. Please correct them to convey accurate meaning using established and accepted scientific terminologies and English conventions. Minor concerns:
  
  Abstract:
  
  "compousing" ?? "highly expressed toxin genes": in what tissues? "genome-wide diversity" ?? "toxin gene families in venomous species" -> "toxin gene families in venomous snake species" Background: "Such advances in sequencing technologies": remove "Such" "depending on their type, interactions, and the organism": interactions with what? "proteomic (and transcriptomic) approaches": remove parenthesis "to new therapies for human illnesses including but not": since the title contains "medically important", it would be great to include some specific examples here from the literature. "However, venomous snakes are one": remove "However" "therefore, the fundamental model system": change "fundamental" to "useful" "of medical importance by the World Health Organization (WHO) due to their": provide citation "Within venomous snakes, the most medically": restructure the sentence for brevity and clarity. "cytotoxic effects (among others)": remove "(among others)" "conducted using a proteomic approach": clarify what proteomic approach mean here. "Hirst et al., (in review);" : remove this citation "within the Viperidae family posses an available reference": change the word "posses" to something meaningful "Moreover, employing several -omics techniques": be specific about techniques "We deciphered numerous genomic attributes": be specific Methods: Describe how blood was extracted from animals with all details including animal handling techniques, body part etc. "was stored in RNAlater until RNA extraction": source for RNAlater "We extracted gDNA from the blood of a female individual": provide additional details such as the quantity of blood used, thawing process, qty of reagents, especially elution buffer etc. Manufacturer protocols may be suited best for mammalian blood (humans, mice) without nucleus in RBCs unlike snakes. "Then, we sequenced a total of two 8M SMRT HiFi cells, aiming for a âˆ¼30x of coverage, at the University of Leiden": provide details of library preparation, sequencing machine etc. "(including venom glands, tongue, liver and pancreas, among others": Either list all or refer to the table. "RNA libraries were prepared with the VAHTS": Was the library and sequencing strand specific? Provide complete details on these processes. "8M SMRT HiFi cell containing two Iso-seq HiFi libraries": use correct names of these and also include sequencing machine details. "Quality control on HiFi and Illumina reads was assessed using FastQC": correct the phrasing of this sentence "To make an initial exploration of the genome, …..we generated a k-mer profile with Meryl": Explicitly state the purpose of this analysis. "Manual curation was performed with Pretext": cite Pretext properly. Explain decisions of this manual curation. i.e. what evidence was used to join or break contigs. "Then, we ran three iterative rounds of RepeatMasker to annotate the known and unknown elements identified by RepeatModeler and soft-masked the genome for simple repeats": break this sentence into two and explain reasons for running RepeatMasker three times. "We used GeMoMa v.1.9": Include all details about the annotations. This sentence is not sufficient for reproducibility. Were the RNAseq data assembled or provided as raw files to GeMoMa. How were they mapped to the genome assembly f "published: Anolis carolinensis from AlfÃ¶ldi": Remove the word "from" here as citation is sufficient. Provide details of assembly versions, annotation version, database of annotations etc. "Crotalus ruber from Hirst et al., (in review)": remove this citation or list it as personal communication "We previously quality checked and removed the adapters of the RNA-seq data": remove "previously" and provide details on how adapters were removed from RNAseq data "also removed the adapters for the Iso-seq data": Explain how this was performed. "We blast our ..": Change all occurrence of "blast" to "BLAST" and specify parameters, if it was BLASTN or BLASTP or something else. This is not clear at all. "we performed additional annotation steps for venom genes.": Details are not complete for reproducibility. State explicitly what decisions were made and how gene structure was determined. This is the main part of the paper and does require accurate details. "Whole-genome synteny was explored between": synteny by definition refers to being on the same string/chromosome. Therefore whole-genome synteny as a term doesn't make sense given that genome is divided into chromosomes. Revise it to say "chromosomal synteny" "chromosomes assembled in the reverse complement, which were corrected using SAMtools faidx": samtools faidx cannot do this. Explain how this was done. "After adapter trimming and quality control, we mapped our RNA-seq reads": how were adapters trimmed and QC implemented. "Gene counts per gene": change gene counts to read counts "Differential expression analyses were carried out": requires additional details such as filters applied for the count, groups compared, statistical model, multiple testing correction methods. "characterize the venom arsenal of Cerastes gasperettii": change the arsenal word. "Fragmentation spectra were matched against a customized database including the bony vertebrates taxonomy dataset of the NCBI non-redundant database": revise for accuracy "Unmatched MS/MS spectra were de novo sequenced": spectra were sequenced how?? "we used blast, incorporating both toxin and non-toxin paralogs": change blast to BLAST and provide additional details about the tool used "Then, we aligned those regions using Mafft (Katoh": provide coordinates of these regions for future research in each assembly "history for the main groups of toxins (i.e.,": parenthesis is not closed. Close it or remove it. "we also included other non-toxin paralogous genes from nontoxic species (for details about this see Supplementary Information": where do I look into the supplementary information? Be very clear. Provide coordinates of regions that were compared. "When needed, we translated CDS": when was this needed? Explain. "built a phylogeny for each of the toxin groups using Phyml": I presume that this is done with translated CDS sequences in toxin genomic regions. Please clarify. "Heterozygous positions were obtained from bam files with Samtools v1.9": provide details as to how this was done. Samtools doesn't have features to operate at a site level and therefore I am confused. "Filtered reads were mapped against the new reference genome of Cerastes gasperettii using the bwa mem algorithm": bwa mem is designed for same species comparisons. Here you have used it for crossspecies. Provide justification and perhaps biases it may have introduced for distantly related species. "SNP calling was carried out …": This is not appropriate as models assume same species data. You have used cross-species alignments, which can be highly biased. Results and Discussion: "PacBio HiFi (~40x), Hi-C (~60x) and Illumina data (~78x)": change to number of base pairs. 40x for a genome of 2GB is 80GB data and for genome of 1GB size, it is 40GB data. Before sequencing and assembly, the genome size cannot be known. "After manual curation, we enhanced the scaffolding parameters of our genome": what was done as manual curation. Please specify. "âˆ¼228 times more contiguous than the Anolis sagrei genome": how is 228 more measured. How is this useful as a metric without the known ground truth. Assemblies can and do have errors. "27,158 different protein-coding genes within our assembly": this seems large compared to other species. Can you elaborate or compare these numbers with other species. "Toxin genes usually found in venomous snakes (see proteome results below) were mainly found on macrochromosomes, although major toxin groups were found on microchromosomes (SVMPs, SVSPs and PLA2; Fig. 1)." : please revise this statement. Two part of the sentence are saying opposite things. Second provide coordinates of these genes as GFF/BED file as supplementary file with their exon structure annotations for others to reuse this information. "showed a great level of similarity between Cerastes gasperettii and Crotalus adamanteus": provide quantitative metrics for "great" level of similarity. "we found several fission events in the A. sagrei genome,": Since A. sagrei genome is not contiguous and chromosome scale, you cannot infer fissions as it may be artefact of non-contiguous assembly. If that is not the case, provide evidence of this. "The last four…": Belongs in methods "Macrosyntenic differences between lizards and snakes": this is very superficial discussion point. Please remove it or strengthen it with evidence. "Heatmap analyses with the most 2,000": Revise this statement. It doesn't make sense. E.g. Heatmap is a visualisation technique and not analyses method. "We studied venom evolution within the most abundant toxin groups": rewrite the sentence for clarity and brevity. "After a thorough manual curation": Explain what was this manual curation process clearly and the purpose of it. "contiguous tandem repeat SVMPs for": Change "repeat" to "array" because tandem repeat has a different meaning in genomics research context. "flanked by the NEFL and NEFM": Unclear if they are both 5' or 3' of toxin genes. Clarify "Microsyntenic analyses showed": change to local synteny "gene copy number variation between": Since these are duplicate copies, clearly state how gene copies were identified. Include details of open reading frames, exon structures, pseudogene status, etc "we can see an expansion in": Describe number of new copies, their status as intact or not, and sequence similarity between copies. Provide evidence that there is no false duplication due to heterozygous allele collapse in the assembly. "More genomic data will indicate if SVMP12": Did you mean SVMP13? "This difference may be expected, as PLA2 only represents around 5% of the proteome for Cerastes gasperettii": This is not true. Proteome doesn't equal to genome in some cases and superficial inference such as this is not warranted. For PSMC analyses, please discuss the effect of mutation rate and generation time. Figures: Figure 1: Add y-axis scales to the circos plot. Figure 1b legend says it is a linkage map, but looks more like HiC contact map. Please edit. Figure 1b legend also says "including the sex chromosomes", which is not consistent with the circos plot. Figure 3A refers to transcriptome and 3b to proteome. Please make this very clear. Figure 4A, C and E, label genes consistent with the phylogenetic trees in supplementary figures so readers can know their genomic arrangements. Figure S4: Discuss why CG1 sample separates from rest of the samples. Seems like a batch effect.
2. GigaScience 08 Jul 2025
  
  in GigaScience
  
  Venoms have traditionally been studied from a proteomic and/or transcriptomic perspective, often overlooking the true genetic complexity underlying venom production. The recent surge in genome-based venom research (sometimes called “venomics”) has proven to be instrumental in deepening our molecular understanding of venom evolution, particularly through the identification and mapping of toxin-coding loci across the broader chromosomal architecture. Although venomous snakes are a model system in venom research, the number of high-quality reference genomes in the group remains limited. In this study, we present a chromosome-resolution reference genome for the Arabian horned viper (Cerastes gasperettii), a venomous snake native to the Arabian Peninsula. Our highly-contiguous genome allowed us to explore macrochromosomal rearrangements within the Viperidae family, as well as across squamates. We identified the main highly-expressed toxin genes compousing the venom’s core, in line with our proteomic results. We also compared microsyntenic changes in the main toxin gene clusters with those of other venomous snake species, highlighting the pivotal role of gene duplication and loss in the emergence and diversification of Snake Venom Metalloproteinases (SVMPs) and Snake Venom Serine Proteases (SVSPs) for Cerastes gasperettii. Using Illumina short-read sequencing data, we reconstructed the demographic history and genome-wide diversity of the species, revealing how historical aridity likely drove population expansions. Finally, this study highlights the importance of using long-read sequencing as well as chromosome-level reference genomes to disentangle the origin and diversification of toxin gene families in venomous species.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf030 ), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  ** Reviewer Blair Perry**
  
  Mochales-Riano et al. present a high-quality genome assembly for the Arabian horned viper and provide a suite of genomic analyses related to synteny, toxin gene evolution and expression, genomic diversity, and demographic history of this and related species. This species is a valuable addition to existing snake genome resources given its medical significance and the current underrepresentation of genomes for Viperidae. I also appreciate that the authors sequenced the heterogametic sex and successfully assembled both sex chromosomes. I do have a few questions and concerns about the manuscript in its current form that I highlight below. Most notably, I feel that the arguments throughout the manuscript about toxin gene copy number correlating with proteomic abundance to be poorly supported and generally problematic given the data and analyses that the authors present. I suggest that the authors reevaluate these claims, and either provide additional analyses in an effort to support these claims or otherwise remove them from the manuscript, as I don't think they are ultimately crucial to the value of this genome report.
  
  Introduction:
  
  I find the argument being made in the sentence beginning "Previous works have shown that changes in gene regulation" a bit confusing. Rather than this arguing that studying the expression of venom genes is "insufficient," I think that this instead argues that transcriptomic and proteomic data are critical for studying venom in conjunction with annotated genome sequence. You could for example have a species with 20 copies in a particular tandem array, but only two of them are ever expressed at biologically meaningful levels and thus contribute proteins to the excreted venom. Knowing both the total number of copies in the genome and the number that are actually contributing to the venom proteome are both valuable and necessary for understanding the evolution of that gene family, its role and significance in venom phenotypes, etc. I'm also not sure I follow the logic of the next sentence. Why exactly would the identification of specifically "unexpressed" toxin genes be particularly notable for antivenom, drug discovery, therapeutics, etc.? "We deciphered numerous genomic attributes of this species including its genetic diversity and failed to find evidence of inbreeding" - lack of inbreeding is never discussed in the context of the heterozygosity results, but is pitched here as a major result of the paper. Did the authors have a priori expectations regarding inbreeding in this species?
  
  Methods:
  
  "Gene counts per gene…" - should this be "Gene expression counts per gene…"? Venom gland RNA-seq data was generated from three animals, but proteomic data was generated from a pool of two other animals. This is not ideal for linking gene expression to venom proteome composition, where you really would want venom collected from the same animals you are getting venom gland RNA from. This is especially true is there is intraspecific variation in venom phenotypes within this species. The latitude and longitude are not provided for the two proteome samples. Were these collected from the same latitude and longitude as the RNA-seq animals? For analyses of heterozygosity, the authors map wgs data from diverse species against the cerastes reference and call variants. Why was this approach chosen over instead mapping the data for each species to either that species' reference (i.e., C. viridis and N. naja) or a more closely related species for those without a reference? Presumably that would reduce the potential influence of reference bias on these estimates of heterozygosity?
  
  Results:
  
  "Toxin genes usually found in venomous snakes (see proteome results below) were mainly found on macrochromosomes, although major toxin groups were found on microchromosomes (SVMPs, SVSPs and PLA2; Fig. 1)" this feels a bit contradictory. Maybe just can state that toxin genes were found on both macro and microchromosomes? "Finally, we also found a battery of 3FTxs and myotoxin-like genes, but they were not represented in our RNA-seq dataset (see below)." The authors do not further discuss this result as implied by "(see below)," unless that was simply referring to subsequent discussion of RNA-seq data. From what I can tell, these are also not present in the proteomic data, correct? "The venom gland transcriptome contained a total of 7,237 genes expressed (TPM > 500), including a total of 65 putative toxin genes. Differential gene expression analyses revealed a total of 161 genes (33 putative toxin genes) that were differentially upregulated (FC > 2 and 1% FDR) in venom glands compared to other tissues (Fig. 3A)." Figure 3A only shows 10 toxin genes with "unique" expression in the venom gland, not the 161 upregulated toxin genes as implied here. The authors should add a heatmap with these 161 genes to the supplement, if not to Figure 3 (guessing it might not fit). Fig 3: The authors do not discuss the lack of unique/upregulated expression evidence for PLA2s and Disintegrins in Fig 3A, despite their contribution to protein composition in Fig 3B. For disintegrins in particular, they represent a higher proportion of the venom proteome than CTLs and CRISPs, yet there is no evidence presented for high expression in these genes. What do the authors think is going on here? Could this be a technical issue related to the processing of the RNAseq data, perhaps related to the small size of these genes? Alternatively, could this be indicative of a mismatch between venom phenotypes of the animals used to generate transcriptomic versus proteomic data? In the text, the authors state "These genes, together with other SVMPs, SVSPs, Disintegrins (DISI) and Ctype lectins (CTL), were highly expressed in the venom gland and form the core toxic effector components of the venom" but again there is no presented evidence for DISI expression in particular. Are these genes included in the 161 upregulated genes in the venom gland? The authors only present proteomic data in the form of a pie chart of overall composition grouped by toxin family (Fig 3B). Does the proteomic data generated here provide individual gene-level proteomic abundance estimates? If so, this would be valuable to include, especially in support of the authors claims about gene copy number being correlated with protein abundance. For example in Figure 3, SVMP9 and SVMP10, and to a lesser extent SVMP13, are highly expressed and therefore possibly/likely the major contributors to SVMPs in the proteome. Is the SVMP section of the pie chart in Fig 3B dominated by proteins from these 3 genes? "We studied venom evolution within the most abundant toxin groups (i.e., SVMPs and SVSPs, as well as PLA2)." PLA2s are a relatively low proportion of the venom proteome in Fig 3B, and are not present in the expression heatmap in Fig 3A. Why were these chosen for further investigation over CTL, CRISP, DISI, etc.? "The amplification of SVMP copy numbers is consistent with proteomic results, as SVMPs were the second most abundant component…". Related to my comment above, are all/many of these copies expressed in proteomic, or at least transcriptomic, data? As the data is currently presented, it appears that a small number of SVMPs are highly expressed and thus likely contributing to the proteome. This does not support, and might in fact contradict, the authors claim that there is an association with increased copy number and contribution to the proteome. Related to this, and more generally, the authors do not present a convincing argument for the relationship between gene copy number and the resulting percentage of a given toxin gene family in the proteome. If copy number is directly related to the resulting amount of a toxin in the proteome, the authors would need to show that many/all of those copies are expressed in the transcriptomic data, and that proteins produced from those genes are present and contributing to the venom proteome (beyond just the total percentage for the family). Further, making any links between copy number and percent overall composition in the proteome is problematic, because it inherently is impacted by copy number variation and expression of all the other toxin genes. You could, in theory, have copy number expansion in a species where all the genes are expressed and contribute to the proteome, but no overall change in the percent of that toxin family in the proteome if other toxin families have also expanded and/or are expressed more highly. Related to this, there is currently no obvious baseline to compare against in order to make these claims that expansion has resulted in higher venom proteome composition (i.e., a situation where we have fewer SVMP gene copies and a corresponding lower percentage of SVMP proteins in the venom proteome). This would potentially require comparison across species and/or populations with differing copy number, etc. My concerns above also apply to the interpretation of SVSP results: "The high number of SVSP genes found (although lower than in Crotalus adamanteus) were in line with the proteomic results, as SVSPs are the most abundant toxin in the proteome (Fig. 3B)." Further, C. adamanteus has a larger number of SVSP genes than C. gasperettii, yet a lower percent composition of SVSPs in the proteome (Margres et al. 2014), emphasizing my concerns about associating copy number and percent composition. Could the two large Group 2 SVSPs in Fig 4E be misannotations of multiple genes? Looking at the adamanteus genes above these, there genes starting and ending at roughly the same position the start and end of these large SVSPs, making me wonder if there are multiple cerastes genes that were annotated as one. In my own experience, I have seen similar situations where FGENESH+ was fed a large region containing multiple genes and annotated multiple genes together as one, so might just be worth double checking that that hasn't happened here. Alternatively, could these be gene fusions? If that's the case, that would presumably complicate the gene tree analyses, correct? i.e., these genes would probably need to excluded from those analyses
3. GigaScience 08 Jul 2025
  
  in GigaScience
  
  Venoms have traditionally been studied from a proteomic and/or transcriptomic perspective, often overlooking the true genetic complexity underlying venom production. The recent surge in genome-based venom research (sometimes called “venomics”) has proven to be instrumental in deepening our molecular understanding of venom evolution, particularly through the identification and mapping of toxin-coding loci across the broader chromosomal architecture. Although venomous snakes are a model system in venom research, the number of high-quality reference genomes in the group remains limited. In this study, we present a chromosome-resolution reference genome for the Arabian horned viper (Cerastes gasperettii), a venomous snake native to the Arabian Peninsula. Our highly-contiguous genome allowed us to explore macrochromosomal rearrangements within the Viperidae family, as well as across squamates. We identified the main highly-expressed toxin genes compousing the venom’s core, in line with our proteomic results. We also compared microsyntenic changes in the main toxin gene clusters with those of other venomous snake species, highlighting the pivotal role of gene duplication and loss in the emergence and diversification of Snake Venom Metalloproteinases (SVMPs) and Snake Venom Serine Proteases (SVSPs) for Cerastes gasperettii. Using Illumina short-read sequencing data, we reconstructed the demographic history and genome-wide diversity of the species, revealing how historical aridity likely drove population expansions. Finally, this study highlights the importance of using long-read sequencing as well as chromosome-level reference genomes to disentangle the origin and diversification of toxin gene families in venomous species.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf030 ), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer Jiatang Li
  
  In the manuscript entitled 'Chromosome-level reference genome for the medically important Arabian horned viper (Cerastes gasperettii)', the authors assembled a high-quality chromosome-level reference genome for the Arabian horned viper (Cerastes gasperettii), a special Viperid species, which is an important data resource. Combined with multi omics data, the authors characterized the genome, conducted the analysis of toxin gene family, and identified a novel SVMP gene. The research is with great significance for the revelation of the origin and diversification of snake venom. Overall, I think the science and findings of the study are meaningful and merit publication, but in its current form, there are some issues should be noticed: 1. It should be noted that Fig. 1 and Fig. 2 both have unidentified border lines.
  
  In all phylogenetic trees presented by the manuscript, it would be better for authors to indicate all species information.
  
  I'm curious if the authors considered period differences in sampling, for example differences in venom glands after venom harvest or in the resting state, which could affect the analysis especially the transcriptome.
  
  In the transcriptomics section, the author stated that the batch effect of CG1 was due to the low mapping of that sample to our reference genome. It is a misinterpretation to me as CG1 itself is the genome sequencing sample. The authors should further explain for this.
  
  The authors need to ensure that all data generated by the manuscript is accessible and information about the data is not currently available.
  
  Please check the references to ensure that the formatting meets the publisher's requirements, e.g., some Latin names of species requiring italics.
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2024.07.29.605543v1
www.medrxiv.org www.medrxiv.org

Health Data Nexus: An Open Data Platform for AI Research and Education in Medicine

1
1. GigaScience 08 Jul 2025
  
  in GigaScience
  
  We outline the development of the Health Data Nexus, a data platform which enables data storage and access management with a cloud-based computational environment. We describe the importance of this secure platform in an evolving public sector research landscape that utilizes significant quantities of data, particularly clinical data acquired from health systems, as well as the importance of providing meaningful benefits for three targeted user groups: data providers, researchers, and educators. We then describe the implementation of governance practices, technical standards, and data security and privacy protections needed to build this platform, as well as example use-cases highlighting the strengths of the platform in facilitating dataset acquisition, novel research, and hosting educational courses, workshops, and datathons. Finally, we discuss the key principles that informed the platform’s development, highlighting the importance of flexible uses, collaborative development, and open-source science.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf050 ), which carries out open, named peer-review. The following review is published under a CC-BY 4.0 license:
  
  Reviewer: Hollis Lai
  
  The purpose of the paper is to demonstrate the adoption of PhysioNet as a medical data sharing platform. The authors outlined the process, workflow, and approval chain required to facilitate such process. The manuscript also provided initial data use and adoption to demonstrate feasibility of such platform. This is a difficult subject to publish as authors do demonstrate the use of platform, but it is difficult to present this subject in a scientific basis.1. The authors describe the datalake require for sharing medical data and does a good job on describing the administrative processes required for such datalake. However, how does this differ from the literature of other platforms? Why was this platform adopted and not other approaches? What information is provided in this adoption that other approaches did not consider or would need to know. I think there is an established literature out there on health data sharing platform that the authors should acknowledge, and highlight how this approach is needed to address these issues.2. The authors highlight adoption data, but no evaluation data was solicited nor provided. Such information would be helpful to know if we were to evaluate how this creation could be replicated. I think there are many great use cases for this outcome but very little is discussed on how it could be applied in the field. For example, is this a method paper promotine others in adopting the platform? or is this a paper demonstrating how others can develop similar platforms?3. There was acutally no relation to AI other than the use of data holding for AI training. The data holding would make sense for UToronto as the process and approvals are built based on local institution requirements. I have tried to access the system as an external and found it intuitive. But, other than building this platform for the purposes of UToronto holding data for UToronto researchers, is there any plans or process for adopting holdings for other institution? How should other users perceive this information? Could other holdings such as administrative data be used?I think the presentation of the article has merit but more needs to be done to capture what has already been done in the field and why this solution also needs to be presented (contribution to the field).
Visit annotations in context

Annotators

GigaScience

URL

medrxiv.org/content/10.1101/2024.08.23.24312060v2
www.biorxiv.org www.biorxiv.org

Analysis-ready VCF at Biobank scale using Zarr

2
1. GigaScience 08 Jul 2025
  
  in GigaScience
  
  Background Variant Call Format (VCF) is the standard file format for interchanging genetic variation data and associated quality control metrics. The usual row-wise encoding of the VCF data model (either as text or packed binary) emphasises efficient retrieval of all data for a given variant, but accessing data on a field or sample basis is inefficient. Biobank scale datasets currently available consist of hundreds of thousands of whole genomes and hundreds of terabytes of compressed VCF. Row-wise data storage is fundamentally unsuitable and a more scalable approach is needed.Results Zarr is a format for storing multi-dimensional data that is widely used across the sciences, and is ideally suited to massively parallel processing. We present the VCF Zarr specification, an encoding of the VCF data model using Zarr, along with fundamental software infrastructure for efficient and reliable conversion at scale. We show how this format is far more efficient than standard VCF based approaches, and competitive with specialised methods for storing genotype data in terms of compression ratios and single-threaded calculation performance. We present case studies on subsets of three large human datasets (Genomics England: n=78,195; Our Future Health: n=651,050; All of Us: n=245,394) along with whole genome datasets for Norway Spruce (n=1,063) and SARS-CoV-2 (n=4,484,157). We demonstrate the potential for VCF Zarr to enable a new generation of high-performance and cost-effective applications via illustrative examples using cloud computing and GPUs.Conclusions Large row-encoded VCF files are a major bottleneck for current research, and storing and processing these files incurs a substantial cost. The VCF Zarr specification, building on widely-used, open-source technologies has the potential to greatly reduce these costs, and may enable a diverse ecosystem of next-generation tools for analysing genetic variation data directly from cloud-based object stores, while maintaining compatibility with existing file-oriented workflows.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf049), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer: Zexuan Zhu
  
  The paper presents an encoding of the VCF data using Zarr to enable fast retrieving subsets of the data. A vcf2arr conversion was provided and validated on both simulated and real-world data sets. The topic of this work is interesting and of good values, however, the experimental studies and contributions should be considerable improved.1. The proposed method is simply a conversion from VCF to Zarr format. Since both are existing formats, the contributions and originality of this work are not impressive.2. The compression and query performance is the main concern of this work. The method should be compared with other state-of-the-art queriable VCF compressors like GTC, GBC, and GSC.Danek A, Deorowicz S. GTC: how to maintain huge genotype collections in a compressed form. Bioinformatics, 2018;34(11):1834-1840.Zhang L, Yuan Y, Peng W, Tang B, Li MJ, Gui H,etal. GBC: a parallel toolkit based on highly addressable byte-encoding blocks for extremely large-scale genotypes of species. Genome Biology, 2023;24(1):1-22.Luo X, Chen Y, Liu L, Ding L, Li Y, Li S, Zhang Y, Zhu Z. GSC: efficient lossless compression of VCF files with fast query. Gigascience, 2024; 2;13:giae046.3. The method should be evaluated on more real VCF data sets.
2. GigaScience 08 Jul 2025
  
  in GigaScience
  
  Background Variant Call Format (VCF) is the standard file format for interchanging genetic variation data and associated quality control metrics. The usual row-wise encoding of the VCF data model (either as text or packed binary) emphasises efficient retrieval of all data for a given variant, but accessing data on a field or sample basis is inefficient. Biobank scale datasets currently available consist of hundreds of thousands of whole genomes and hundreds of terabytes of compressed VCF. Row-wise data storage is fundamentally unsuitable and a more scalable approach is needed.Results Zarr is a format for storing multi-dimensional data that is widely used across the sciences, and is ideally suited to massively parallel processing. We present the VCF Zarr specification, an encoding of the VCF data model using Zarr, along with fundamental software infrastructure for efficient and reliable conversion at scale. We show how this format is far more efficient than standard VCF based approaches, and competitive with specialised methods for storing genotype data in terms of compression ratios and single-threaded calculation performance. We present case studies on subsets of three large human datasets (Genomics England: n=78,195; Our Future Health: n=651,050; All of Us: n=245,394) along with whole genome datasets for Norway Spruce (n=1,063) and SARS-CoV-2 (n=4,484,157). We demonstrate the potential for VCF Zarr to enable a new generation of high-performance and cost-effective applications via illustrative examples using cloud computing and GPUs.Conclusions Large row-encoded VCF files are a major bottleneck for current research, and storing and processing these files incurs a substantial cost. The VCF Zarr specification, building on widely-used, open-source technologies has the potential to greatly reduce these costs, and may enable a diverse ecosystem of next-generation tools for analysing genetic variation data directly from cloud-based object stores, while maintaining compatibility with existing file-oriented workflows.
  
  This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf049), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer: Nezar Abdennur
  
  The authors present VCF Zarr, a specification that translates the variant call format (VCF) data model into an array-based representation for the Zarr storage format. They also present the vcf2zarr utility to convert large VCFs to Zarr. They provide data compression and analysis benchmarks comparing VCF Zarr to existing variant storage technologies using simulated genotype data. They also present a case study on real world Genomics England aggV2 data.The authors' benchmarks overall show that VCF Zarr has superior compression and computational analysis performance at scale relative to data stored as roworiented VCF and that VCF Zarr is competitive with specialized storage solutions that require similarly specialized tools and access libraries for querying. An attractive feature is that VCF Zarr allows for variant annotation workflows that do not require full dataset copy and conversion. Another key point is that Zarr is a high-level spec and data model for the chunked storage of n-d arrays, rather than a bytelevel encoding designed specifically around the genomic variant data type. I personally have used Zarr productively for several applications unrelated to statistical genetics. While Zarr VCF mildly underperforms some of the specialized formats (Savvy in compute, Genozip in compression) in a few instances, I believe the accessibility, interoperability, and reusability gains of Zarr make the small tradeoff well worthwhile.Because Zarr has seen heavy adoption in other scientific communities like the geospatial and Earth sciences, and is well integrated in the scientific Python stack, I think it holds potential for greater reusability across the ecosystem. As such, I think the VCF Zarr spec is a highly valuable if not overdue contribution to an entrenched field that has recently been confronted by a scalability wall.Overall, the paper is clear, comprehensive, and well written. Some high-level comments: The benefits for large scientific datasets to be analysis-ready cloud-optimized (ARCO) have been well articulated by Abernathey et al., 2021. However, I do think that the "local"/HPC single-file use case is still important and won't disappear any time soon, and for some file system use cases, expansive and deep hierarchies can be performance limiting (this was hinted at in one of the benchmarks). In this scenario would a large Zarr VCF perform reasonably well (or even better on some file systems) via a single local zip store? The description of the intermediate columnar format (ICF) used by vcf2zarr is missing some detail. At first I got the impression it might be based on something like Parquet, but running the provided code showed that it consists of a similar file-based chunk layout to Zarr. This should be clarified in the manuscript. The authors discuss the possibility of storing an index mapping genomic coordinates to chunk indexes. Have Zarr-based formats in other fields like geospatial introduced their own indexing approaches to take inspiration from? Since VCF Zarr is still a draft proposal, it could be useful to indicate where community discussions are happening and how potential new contributors can get involved, if possible. This doesn't need to be in the paper per se, but perhaps documented in the spec repo.Minor comments: In the background: "For the representation to be FAIR, it must also be accessible," -- A is for "accessible", so "also" doesn't make sense. "There is currently no efficient, FAIR representation...". Just a nit and feel free to ignore, but the solution you present is technically "current".* In Figure 2, the zarr line is occluded by the sav line and hard to see.
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2024.06.11.598241v3
Jun 2025
www.biorxiv.org www.biorxiv.org

The first near-complete genome assembly of pig: enabling more accurate genetic research

3
1. GigaScience 03 Jun 2025
  
  in GigaScience
  
  Pigs are crucial sources of meat and protein, valuable animal models, and potential donors for xenotransplantation. However, the existing reference genome for pigs is incomplete, with thousands of segments and missing centromeres and telomeres, which limits our understanding of the important traits in these genomic regions. To address this issue, we present a near complete genome assembly for the Jinhua pig (JH-T2T), constructed using PacBio HiFi and ONT long reads. This assembly includes all 18 autosomes and the X and Y sex chromosomes, with only six gaps. It features annotations of 46.90% repetitive sequences, 35 telomeres, 17 centromeres, and 23,924 high-confident genes. Compared to the Sscrofa11.1, JH-T2T closes nearly all gaps, extends sequences by 177 Mb, predicts more intact telomeres and centromeres, and gains 799 more genes and loses 114 genes. Moreover, it enhances the mapping rate for both Western and Chinese local pigs, outperforming Sscrofa11.1 as a reference genome. Additionally, this comprehensive genome assembly will facilitate large-scale variant detection and enable the exploration of genes associated with pig domestication, such as GPAM, CYP2C18, LY9, ITLN2, and CHIA. Our findings represent a significant advancement in pig genomics, providing a robust resource that enhances genetic research, breeding programs, and biomedical applications.
  
  A version of this preprint has been published in the Open Access journal GigaScience (see paper https://doi.org/10.1093/gigascience/giaf048), where the paper and peer reviews are published openly under a CC-BY 4.0 license.
  
  Revision 2 version
  
  Reviewer 2: Benjamin D Rosen
  
  The first near-complete genome assembly of pig: enabling more accurate genetic research.
  
  General comments:
  
  The authors have clarified how their HiC manual curation efforts were able to remove gaps from the assembly. This was my only remaining major issue. I only have a few minor comments remaining.
  
  Minor comments:
  
  Line 1 - Title: "A Near Telomere-to-Telomere Genome Assembly of the Jinhua Pig"
  
  Line 369 - replace "only 6 gaps left in our final JH assembly" with "only 6 gaps remain in our final JH assembly"
  
  Line 370 - Figure S5 needs a more detailed legend
  
  Line 405 - I just noticed this, but are the authors proposing that chr9 has 2 centromeres? Given the know pig karyotype (metacentric chr9), it seems more likely that they have identified some other form of tandem repeat at the beginning of chr9.
2. GigaScience 03 Jun 2025
  
  in GigaScience
  
  Pigs are crucial sources of meat and protein, valuable animal models, and potential donors for xenotransplantation. However, the existing reference genome for pigs is incomplete, with thousands of segments and missing centromeres and telomeres, which limits our understanding of the important traits in these genomic regions. To address this issue, we present a near complete genome assembly for the Jinhua pig (JH-T2T), constructed using PacBio HiFi and ONT long reads. This assembly includes all 18 autosomes and the X and Y sex chromosomes, with only six gaps. It features annotations of 46.90% repetitive sequences, 35 telomeres, 17 centromeres, and 23,924 high-confident genes. Compared to the Sscrofa11.1, JH-T2T closes nearly all gaps, extends sequences by 177 Mb, predicts more intact telomeres and centromeres, and gains 799 more genes and loses 114 genes. Moreover, it enhances the mapping rate for both Western and Chinese local pigs, outperforming Sscrofa11.1 as a reference genome. Additionally, this comprehensive genome assembly will facilitate large-scale variant detection and enable the exploration of genes associated with pig domestication, such as GPAM, CYP2C18, LY9, ITLN2, and CHIA. Our findings represent a significant advancement in pig genomics, providing a robust resource that enhances genetic research, breeding programs, and biomedical applications.
  
  A version of this preprint has been published in the Open Access journal GigaScience (see paper https://doi.org/10.1093/gigascience/giaf048), where the paper and peer reviews are published openly under a CC-BY 4.0 license.
  
  Revision 1 version
  
  Reviewer 1: Martien Groenen
  
  In their revised version of the manuscript, the authors have addressed all my major concerns raised in my earlier review and have made the many editorial edits as suggested. I only have a few (mostly editorial) comments for the revised version. The most important one is the title of the manuscript. I realize I did not mention this in my earlier review, but I think the title is not very appropriate and could be more informative. I suggest something like "A telomere-to-telomere genome assembly of the Jinhua pig"
  
  Minor editorial comments: Line 40: Replace "provides" by "provide"; "genome" to "genomes" and "JH" to "Jinhua" Lines 50-51: "This study produced a gapless and near-gapless assembly of the pig genome, and provides a set of diploid JH reference genome." Should be changes to something like "This study produced a near-gapless assembly of the pig genome and provides a set of haploid Jinhua reference genomes." Line 177: Change "with with" to "with" Line 194: Replace "population" by "populations" Lines 232-233: Referring to human as a "closely related species" is rather awkward and not correct. I suggest replacing this with "eleven other mammals" Lines 299, 301 and 303: Insert "of" after "consisting" Line 317: Insert "and" before "2.33 Gb" Line 319: Insert "and" before "2.17 Gb" Line 320-321: Change to "The more continuous contigs of the two assemblies were selected to construct the final haploid assemblies". Line 323: Replace "assembly" by "assembler "Line 354: Delete "ranging" Lines 358359: Change "The average properly mapped rate" to "The average rate of properly mapped reads" Line 379: Insert "respectively" after "60.07"Line 380: "suggested" (remove space)Line 385: Change "indicate a gapless and near-gapless" to "indicate a near-gapless" Line 455: Change "were overlapped with" to "were overlapping with" Lines 557-559" The sentence "The insertion found in the SLA-DOB gene, which serves to enhance the immune system's response and is relevant to transplant rejection" seems incomplete and sound awkward. Perhaps you mean something like "The insertion found in SLA-DOB, a gene involved in enhancing the immune system's response to infection, might be relevant in relation to transplant rejection"
  
  Reviewer 2: Benjamin D Rosen
  
  The first near-complete genome assembly of pig: enabling more accurate genetic research
  
  General comments: I thank the authors for addressing most of my points and providing more details on the parameters they have used. Unfortunately, I still have some unanswered questions regarding the methodology. My current understanding from the authors responses to my previous comments leads me to believe that the assembly has been scaffolded incorrectly. If the authors did indeed use HiC data to place 8 contigs into gaps and then joined those contigs without placing gaps at the joins or doing any further gap filling, that calls into question the validity of the assembly. Finally, the language needs further improvement for readability.
  
  Specific comments: Line 85 - *will contribute to. Lines 187-191 - HiC interaction maps do not provide information for gap filling. Either this has been explained insufficiently, or it has been done incorrectly. Placing assembled sequences in the correct order does not mean that it is okay to join them without a gap. It is necessary to return to the gap filling procedure now that the contigs are in the correct order and attempt to fill them as done previously. Line 191 - Figure S3 - These HiC contact maps are not very informative they need to be labeled and have a scale bar. Additionally, contact maps can have a lack of signal due to a gap in the sequence or due to multimapping reads in repetitive regions being filtered so it's not clear what they are trying to show in A-C. The authors reply to my previous concern regarding the labeling of this figure does not help, furthermore, the figure legend in the supplemental materials is still insufficient. I think I understand that panels D and E are chr3 before and after misassembly correction, it would be helpful if the two panels were at the same scale. I still don't know why panel F is shown, how is this related to panel C and I don't see any red ellipses indicated by the legend. Line 275 - "ensemble from Duroc pigs" is incorrect. It is an "assembly of a Duroc pig". Lines 299, 301, 303 - "containing" not "consisting" Lines 306-308 - Again, HiC data orders and orients contigs, but it does not fill gaps. Please clarify how the assembly was reduced from 14 gaps to 6 gaps with HiC data. Was an additional round of gap filling performed? Lines 313-314 - How is the contig N50 larger than the scaffold N50 above? Lines 335-336 - Does this refer to the Merqury analysis? I don't think "using mapped K-mers" is correct here, please reword. Lines 367-368 - what does it mean that "8 out of 63 gaps were corrected" is this from the HiC ordering of contigs? Line 369 - what does the mapping between Sscrofa11.1 and JH-T2T shown in figure S6 have to do with the JH-T2T gap filling being described here? Line 369 - I previously asked about this supplemental table only containing 55 entries. The authors response "The other filled 8 gaps were resolved through adjustments made to the Hi-C map to correct misassembles. As a result, these gaps cannot be precisely located within the existing order of the assembly." indicates that contigs must have been incorrectly joined solely based on the HiC signal between contigs. The authors must know what contigs were added or joined to form the final assembly. It would be trivial to align the two assembly versions and identify the positions of the old contigs in the new assembly. I believe that these incorrectly joined contigs should be broken and put through the same gap filling procedure as performed earlier. Lines 375-378 - Dramatic coverage changes in read mappings as found in these figures are usually indicative of assembly errors. I do not agree that "These findings confirmed the accuracy and reliability" of the assembly. I suggest replacing the last sentence with something more measured such as "Although supported by some read data, the inconsistency of coverage across these gap filled regions suggests that caution should be used when interpreting findings in these regions, cross-referencing results with the gap positions (Supplementary Table S9) is advised." Line 375 - "evidenced by fully coverage" remove "fully", it isn't proper usage of the word and I wouldn't interpret the low coverage in many of these regions as "full coverage". Line 385 - should read "Overall, our assembly quality metrics indicate a near-gapless assembly of the pig genome" Line 390 - should read "a gapless T2T sequence for 16 out of 20" Line 396 - Supplemental table 10 not 9.Lines 398399 - according to supplemental table S4 and figure 3A, chromosome 2 also has a single telomere. Line 402 - the centromeres are not marked in Figure 3A.Line 402 - Figure S8 - please rename chr19 and chr20, chrX and chrY. Line 406 - "at early research" unclear what is meant by this. please reword. Line 423 - as indicated on line 397, 33 telomeres were identified, not 35.Line 426 - "The JH-T2T assembly IDENTIFIED 17 centromeres" Line 450 - "are located in" Line 453 - "these SVs are located in" Line 455 - Moreover, 12,129 genes overlap these SVs" Line 502 - "which contained 544 gaps" Line 841 - Figure 2 legend description is still incorrect. Only A is mapping rates, B and C are PM rates and base error rates.
3. GigaScience 03 Jun 2025
  
  in GigaScience
  
  Pigs are crucial sources of meat and protein, valuable animal models, and potential donors for xenotransplantation. However, the existing reference genome for pigs is incomplete, with thousands of segments and missing centromeres and telomeres, which limits our understanding of the important traits in these genomic regions. To address this issue, we present a near complete genome assembly for the Jinhua pig (JH-T2T), constructed using PacBio HiFi and ONT long reads. This assembly includes all 18 autosomes and the X and Y sex chromosomes, with only six gaps. It features annotations of 46.90% repetitive sequences, 35 telomeres, 17 centromeres, and 23,924 high-confident genes. Compared to the Sscrofa11.1, JH-T2T closes nearly all gaps, extends sequences by 177 Mb, predicts more intact telomeres and centromeres, and gains 799 more genes and loses 114 genes. Moreover, it enhances the mapping rate for both Western and Chinese local pigs, outperforming Sscrofa11.1 as a reference genome. Additionally, this comprehensive genome assembly will facilitate large-scale variant detection and enable the exploration of genes associated with pig domestication, such as GPAM, CYP2C18, LY9, ITLN2, and CHIA. Our findings represent a significant advancement in pig genomics, providing a robust resource that enhances genetic research, breeding programs, and biomedical applications.
  
  A version of this preprint has been published in the Open Access journal GigaScience (see paper https://doi.org/10.1093/gigascience/giaf048), where the paper and peer reviews are published openly under a CC-BY 4.0 license.
  
  Original version
  
  Reviewer 1: Martien Groenen
  
  The manuscript describes the T2T genome assembly for the Chinese pig breed Jinhua, which presents a vast improvement compared to the current reference genome of the Duroc pig TJTabasco (build11.1). The results and methodology use for the assembly are described clearly and the authors show the improvement of this assembly by a detailed comparison with the current reference 11.1. While clearly of interest to be published, several aspects of the manuscript should be improved. Most of these changes are minor modifications or inaccuracies in the presentation of the results.
  
  However, there are two major aspects that need further attention:
  
  The T2T assembly presented, represents a combination of the two haplotypes of the pig sequenced. I am surprised why the authors did not also develop two haplotype resolved assemblies of this genome. Haplotype resolved assemblies will be the assemblies of choice for future developments of a reference pan-genome for pigs. The authors describe that they have sequenced the two parents of the sequenced F1 individual, so why did they not use the trio-binning approach to also develop haplotype resolved assemblies. I, think adding these to the manuscript would be a vast improvement for this important resource.
  
  The results described for the identification of selective sweep regions is not very convincing. This analysis shows differences in the genomes of two breeds: Duroc and Jinhua. However, these breeds have a very different origin of domestication of wild boars that diverged 1 million years ago, followed by the development of a wide range of different breeds selected for different traits. Therefore, the comparison made by the authors cannot distinguish between differences in evolution of Chinese and European Wild Boar, more recent selection after breed formation and even drift. To be able to do so, these analyses would need the inclusion of additional breeds and wild boars from China and Europe. Alternatively, the authors can decide to tone down this part of the manuscript or even delete it altogether, as it does not add to the major message of the manuscript.Minor comments Line 34: Change the sentence to: "with thousands of segments and centromeres and telomeres missing" Line 37: Insert "and Hi-C" after "long reads "Line 46: Delete " such as GPAM, CYP2C18, LY9, ITLN2, and CHIA" Line 54: Insert "potential" before "xenotransplantation" Line 82: Delete "in response to the gap of a T2T-level pig genome" as this does not add anything and the use of "gap" in this context is confusing. Line 93: Change "The fresh blood" to "Fresh blood" Line 100: The authors need to provide a reference for the SDS method. Lines 152-153, line 444, and table S6: This is confusing. The authors mention Genotypes from 939 individuals, but in the table it is shown that they have used WGS data. You need to describe how the WGS data was used to call the genotypes for these individuals. Furthermore, in line 444 you mention 289 JH pigs and 616 DU pigs which together is 905. What about the other 34 individuals shown in table S6?Line 244: Replace "were" by "was" and delete "the" before "fastp" Lines 287292: Here you use several times "length of xx Gb and yy contigs". This is not correct as the value for the contigs refers to a number and not a length. Rephase e.g. like "length of xx Gb and consisting of yy contigs" Line 294: The use of "bone" sems strange. Either use "backbone" or "core"Line 306: Replace "chromosome" by "genome" Lines 308-309: For the comment "Second, 16 of the 20 chromosomes were each represented by a single contig" you refer to figure 1D however from this figure it cannot be seen if the different chromosomes consist of a single or multiple contigs. Line 346: Do you mean build 11.1 with "historical genome version". If so, please use that instead. Line 349: "post-gap filled" Line 353: The largest gap is 35 kb not 36 kb. Figures 2F-I should be better explained in the legends and the main text (lines 353-358). Lines 378: For the 23,924 genes you refer to supp table S13. However, that table shows a list of SV enriched QTL not these genes. Furthermore, I checked all tables but a table with all the protein coding genes is missing. Line 380: For the 799 newly anchored genes, refer to table S10. Now you refer to table S17 which shows genes enriched KEGG pathways. Lines 383-386: For the higher gene density in GC rich regions, you refer to figure 1D, but it is impossible to see this correlation from figure 1D. For the density of genes and telomeres, you refer to figure 1G. However, that figure does not show gene densities only repeat densities. Line 406-407. This should be table S11.Lines 409412: For this result you refer to table S11. However, that table only shows data for the gained genes, not the lost genes. Lines 419-420: You refer to table S12 and figure 3B, but the information is only shown in figure 3B and not in table S12.Line 420: Replace "were" by "is" Line 422: Better to use "repeats" instead of "they" Line 425: "Moreover, 12,129 genes located in these SVs". Unclear to what "these" refers to and I assume that you mean genes that (partially) overlap with SVs? Also, this is an incomplete sentence (verb missing). Likewise, this number is not very meaningful as many of these SVs are within introns. It is much more informative to mention for how many genes SVs affect the CDS. Line 433 and table S14: This validation is not clear at all. What exactly are these numbers that are shown? You also mention "greater than 1.00" but the table does not contain any number that is greater than 1.00. Line 435: "Table" not "Tables" Line 436: Change to " SVs with a length larger than 500 bp "The term "invalidate" in figure 3D is rather awkward. Better to use "not-validated" and "validated" in this figure. Line 449: This should be Table S16. Line 452: There is not Table S18Lines 484-486: Change to "Similarly, in human, the use of the T2T-CHM13 genome assembly yields a more comprehensive view of SVs genome-wide, with a greatly improved balance of insertions and deletions [61]." Lines 500-501: Change to "For example, in human, the T2T-CHM13 assembly was shown to improve the analysis of global" Lines 517-528: This paragraph should be deleted as these genes have already been annotated and described in previous genome builds including 11.1. Why discuss these genes here? Following that line of thinking, almost every gene of the 20,000 can be discussed. Line 532: "%" instead of "%%" and insert "which" after "SVs" Lines 537-542: These sentences should be deleted. It is common knowledge that second generation sequencing is not very sensitive to identify SVs. The authors also do not provide any results about dPCR. Line 544: "affect" rather than "harbor" Lines 544-547: This is repetitive and has been stated multiple times so better to delete. Line 561: "which is serve to immune system's response and relevant to transplant rejection" This is an incorrect sentence and should rephrased. Lines 562-568: I don't agree with is statement and suggest to remove it from the discussion.
  
  Reviewer 2: Benjamin D Rosen
  
  The first near-complete genome assembly of pig: enabling more accurate genetic research. The authors describe the telomere-to-telomere assembly of a Jinhua breed pig. They sequenced genomic DNA from whole blood with PacBio HiFi and Oxford Nanopore (ONT) long-read technologies as well as Illumina for short reads. They generated HiC data for scaffolding from blood and extracted RNA from 19 tissues for short read RNAseq for gene annotation. A hifiasm assembly was generated with the HiFi data and scaffolded with HiC to chromosome level with 63 gaps. The scaffolded assembly was gap filled with contigs from a NextDenovo assembly of the ONT data bringing the gaps down to 14. Finally, the assembly was manually curated with juicebox somehow closing a further 8 gaps. This needs to be clarified. Standard assembly assessments were performed as well as genome annotation. The authors compared their assembly to the current reference, Sscrofa11.1, and called SVs between the assemblies. The SVs were validated with additional Jinhua and Duroc animals. They then identified signatures of selection present in some of the largest SVs.
  
  General comments: The manuscript is mostly easy to read but would benefit from further editing for language throughout. The described assembly appears to be high quality and quite contiguous. Although the authors do mention obtaining parental samples and claim the assembly is fully phased, there is no mention of how this was done. There are many additional places where the methods could be described more fully including the addition of parameters used.
  
  Specific comments: Line 39 - Figure 1 only displays 34 telomeres, not 35. Additionally, I was only able to detect 33 telomeres using seqtk telo. Seqtk only reports telomeres at the beginning and end of sequences, digging further, the telomere on chr2 is ~59kb from the end of the chromosome, perhaps indicating a misassembly. Lines 79-81 - there are not hundreds of species with gap free genome assemblies and reference 19 does not claim that there are. Line 82 - the assembly is not gap-free, replace with "nearly gap-free" Line 95 - were these parental tissue samples ever used? Lines 151-156 - this section would be better located below the assembly methods. Please number supplementary tables in order of their appearance in the text. Line 171 - please provide parameters used here and for all analyses. Lines 187-188 - how did rearranging contigs decrease the gaps? Was the same gap filling procedure used after HiC manual adjustments? Line 188 - Figure S3 - I don't understand the relationship between the panels nor what the authors are attempting to show. If panels A-C display chromosomes 2, 8, and 13, Why does D display chr3? Both panels C and E are labeled chr13 but they look nothing alike. Are D-E whole chromosomes or zoomed in views? Missing description of panel F. Lines 222-224 - why weren't pig proteins used? Ensembl rapid release has annotated protein datasets for 9 pig assemblies. Line 264 - although most will know this, make it clear that Sscrofa11.1 is an assembly of a Duroc pig. Line 292 - how was polishing performed? This is missing from the methods. Line 294 - should this read "selected it for the backbone of the genome assembly."? Lines 298-299 - methods? Line 314 - what is meant by "using mapped K-mers from trio Illumina PCR-free reads data"? Line 331 - accession numbers for assemblies would be useful. Line 333 - what is "properly mapped rate"? Do you mean properly paired mapping rate? Line 346 - what is the historical genome version? Line 349 - Supplemental Table S8 only has 55 entries including the 6 remaining gaps. Where are the other filled 8 gaps located? Lines 350-358 - read depth displays wouldn't show the presence of clipped reads which would indicate an improperly closed gap. It would be more convincing to display IGV windows containing these alignments showing that there are no clipped reads. Line 354 - Figure S5 needs a better legend. What is ref and what is own? Line 359 - the assembly is near-gapless. Line 359 - where is the data regarding assembly phasing? How was this determined to be fully phased? Line 363 - 16 of 20 chromosomes are gapless. Line 370 - only 33 telomeres were found at the expected location (end of the chromosome), if you count the telomere on chr2 59kb from the end, then 34 telomeres were identified. Line 372 - chr13 also only has a single telomere. It does not have a telomere at the beginning. Line 372 - chr19 is chrX correct? Line 374 - Figure 1G - It would be nice to have the centromeres marked on this plot (or in Figure 3A). Are the long blocks of telomeric repeats internal to the chromosomes expected? Line 423 - Figure 3A - there is no telomeric repeat at the beginning of chr4 or chrXLine 431 - why were only 5 pigs of each breed used to validate SVs when 100's of WGS datasets from the two breeds had been aligned? How were these 5 selected? Line 481 - Sscrofa11.1 only has 544 gaps.Line 492 - ONT data was used to fill more than 6 gaps. Gaps in the assembly were reduced from 63 to 14 using ONT contigs. Lines 588-589 - please make your code publicly available through zenodo, github, figshare, or something similar. Line 815-824 - Figure 2 - legend description needs to be improved. Only A is mapping rates, B and C are PM rates and base error rates. The color switch from A-C having European pigs in blue to D having JH-T2T in blue might confuse readers.
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2024.10.13.617951v1
www.biorxiv.org www.biorxiv.org

External validation of machine learning models - registered models and adaptive sample splitting

2
1. GigaScience 03 Jun 2025
  
  in GigaScience
  
  Multivariate predictive models play a crucial role in enhancing our understanding of complex biological systems and in developing innovative, replicable tools for translational medical research. However, the complexity of machine learning methods and extensive data pre-processing and feature engineering pipelines can lead to overfitting and poor generalizability. An unbiased evaluation of predictive models necessitates external validation, which involves testing the finalized model on independent data. Despite its importance, external validation is often neglected in practice due to the associated costs. Here we propose that, for maximal credibility, model discovery and external validation should be separated by the public disclosure (e.g. pre-registration) of feature processing steps and model weights. Furthermore, we introduce a novel approach to optimize the trade-off between efforts spent on training and external validation in such studies. We show on data involving more than 3000 participants from four different datasets that, for any “sample size budget”, the proposed adaptive splitting approach can successfully identify the optimal time to stop model discovery so that predictive performance is maximized without risking a low powered, and thus inconclusive, external validation. The proposed design and splitting approach (implemented in the Python package “AdaptiveSplit”) may contribute to addressing issues of replicability, effect size inflation and generalizability in predictive modeling studies.
  
  A version of this preprint has been published in the Open Access journal GigaScience (see paper (https://doi.org/10.1093/gigascience/giaf036), where the paper and peer reviews are published openly under a CC-BY 4.0 license.
  
  Revised 1 version
  
  Reviewer 1: Qingyu Zhao
  
  Thank for the authors for the thorough response. The only remaining comment is that some new supplement figures (figures 8-12) are not cited or explained in the main text (maybe I missed it?). Please make sure to discuss these supplement figures in the main text otherwise readers wouldn't know they are there. The response reads "To provide even more insights, we now present the relationship between the internally validated scores at the time of stopping (I_{act}), the corresponding external validation scores and sample sizes, for all 4 datasets in supplementary figures 8-11. The figures show a relatively good correspondence between internally and externally validated performance estimates with all splitting strategies". What insights are given? What do you mean by relatively good correspondence between internal and external performance? All I see in those figures are some normally distributed scatter plots, so it needs better explanation.
  
  Reviewer 2: Lisa Crossman
  
  I previously reviewed this MS and all the comments I made were answered in full. I would be pleased to recommend publication. I was fully able to replicate the adaptive split results from the GitHub repo. I have only one comment which is that I received several generated warnings of "RuntimeWarning: divide by zero encountered in scalar divide", and these can also be seen in the Jupyter notebook example.
2. GigaScience 03 Jun 2025
  
  in GigaScience
  
  Multivariate predictive models play a crucial role in enhancing our understanding of complex biological systems and in developing innovative, replicable tools for translational medical research. However, the complexity of machine learning methods and extensive data pre-processing and feature engineering pipelines can lead to overfitting and poor generalizability. An unbiased evaluation of predictive models necessitates external validation, which involves testing the finalized model on independent data. Despite its importance, external validation is often neglected in practice due to the associated costs. Here we propose that, for maximal credibility, model discovery and external validation should be separated by the public disclosure (e.g. pre-registration) of feature processing steps and model weights. Furthermore, we introduce a novel approach to optimize the trade-off between efforts spent on training and external validation in such studies. We show on data involving more than 3000 participants from four different datasets that, for any “sample size budget”, the proposed adaptive splitting approach can successfully identify the optimal time to stop model discovery so that predictive performance is maximized without risking a low powered, and thus inconclusive, external validation. The proposed design and splitting approach (implemented in the Python package “AdaptiveSplit”) may contribute to addressing issues of replicability, effect size inflation and generalizability in predictive modeling studies.
  
  A version of this preprint has been published in the Open Access journal GigaScience (see paper (https://doi.org/10.1093/gigascience/giaf036), where the paper and peer reviews are published openly under a CC-BY 4.0 license.
  
  Original version
  
  Reviewer 1: Qingyu Zhao
  
  The manuscript discusses an interesting approach that seeks optimal data split for the pre-registration framework. The approach adaptively optimizes the balance between predictive performance of discovery set and sample size of external validation set. The approach is showcased on 4 applications, demonstrating advantage over traditional fixed data split (e.g., 80/20). I generally enjoyed reading the manuscript. I believe pre-registration is one important tool for reproducible ML analysis and the ideology behind the proposed framework (investigating the balance between discovery power and validation power) is urgently needed. My main concerns are all around Fig. 3, which represents the core quantitative analysis but lacks many details.
  
  Fig. 3 is mostly about external validation. What about training? For each n_total, which stopping rule is activated? What is the training accuracy? What does l_act look like? What is \hat{s_total}?
  
  Results section states "the proposed adaptive splitting strategy always provided equally good or better predictive performance than the fixed splitting strategies (as shown by the 95% confidence intervals on Figure 3)". I'm confused by this because the blue curve is often below other methods in accuracy (e.g., comparing with 90/10 split in ABIDE and HCP).
  
  Why does the half split have the lowest accuracy but the highest statistical power?
  
  How was the range of x-axis (n_total) selected? E.g., HCP has 1000 subjects, why was 240-380 chosen for analysis?
  
  The lowest n_total for BCW and IXI is approximately 50. If n_act starts from 10% of n_total, how is it possible to train (nested) cross-validation on 5 samples or so?
  
  Two other general comments are: 1. How can this be applied to retrospective data or secondary data analysis where the collection is finished? 2. Is there a guidance on the minimum sample size that is required to perform such an auto-split analysis? It is surprising that the authors think the two studies with n=35 and n=38 are good examples of training generalizable ML models. It is generally hard to believe any ML analysis can be done on such low sample sizes with thousands of rs-fMRI features. By the way, I believe n=25 in Kincses 2024 if I read it correctly.
  
  Reviewer 2: Lisa Crossman
  
  External validation of machine learning models - registered models and adaptive sample splitting Gallito et al. The Manuscript describes a methodology and algorithm aimed at better choosing a train-test validation split of data for scikit-learn models. A python package, adaptivesplit, was built as part of this MS as a tool for others to use. The package is proposed to be used together with a suggested workflow to integrate an approach invoking registered models as a full design for better prospective modelling studies. Finally, the work is evaluated on four alternative publicly available datasets of health research data and comprehensive results are presented. There is a trade-off in the split between the amount of sample data to be used for training and the amount of data to use for validation. Ideally the content of each must be balanced in order for the trained model to be representative and equally for the validation set to be representative. This manuscript is therefore very timely due to the large increase in the use of AI models and provides important information and methodology.
  
  This reviewer does not have the specific expertise to provide detailed comments on the statistical rule methods.
  
  Main Suggested Revision: 1. The Python implementation of the "adaptivesplit" package is described as available on GitHub (Gallitto et al., n.d.). One of the major points of the paper is to provide the python package "adaptivesplit", however, this package does not have a clear hyperlink, and is not found by simple google searches, and it appears is not yet available. It is therefore not possible to evaluate it at present. There is a website found available with a preprint of this MS after further google searches, https://pnilab.github.io/adaptivesplit/ however, adaptive split is here shown as an interactivate jupyter-type notebook example and not as a python library code. Therefore, it is not clear how available the package is for others' use. Can the authors comment on the code availability?
  
  Minor comments: 1. Apart from the 80:20 Pareto split of train-test data, other splits are commonly used in ratios such as 75:25 (the scikit-learn default split if ratio is unspecified), and 70:30. Also the cross-validation strategy with train-test-validation split 60:20:20, yet these strategies have not been mentioned or included in the figures such as Fig 3. The splits provided in the figure and discussed are 50:50, 80:20 and 90:10 only. Could the authors discuss alternative split ratios?
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2023.12.01.569626v2
www.biorxiv.org www.biorxiv.org

Spatial Integration of Multi-Omics Data using the novel Multi-Omics Imaging Integration Toolset

2
1. GigaScience 03 Jun 2025
  
  in GigaScience
  
  To truly understand the cancer biology of heterogenous tumors in the context of precision medicine, it is crucial to use analytical methodology capable of capturing the complexities of multiple omics levels, as well as the spatial heterogeneity of cancer tissue. Different molecular imaging techniques, such as mass spectrometry imaging (MSI) and spatial transcriptomics (ST) achieve this goal by spatially detecting metabolites and mRNA, respectively. To take full analytical advantage of such multi-omics data, the individual measurements need to be integrated into one dataset. We present MIIT (Multi-Omics Imaging Integration Toolset), a Python framework for integrating spatially resolved multi-omics data. MIIT’s integration workflow consists of performing a grid projection of spatial omics data, registration of stained serial sections, and mapping of MSI-pixels to the spot resolution of Visium 10x ST data. For the registration of serial sections, we designed GreedyFHist, a registration algorithm based on the Greedy registration tool. We validated GreedyFHist on a dataset of 245 pairs of serial sections and reported an improved registration performance compared to a similar registration algorithm. As a proof of concept, we used MIIT to integrate ST and MSI data on cancer-free tissue from 7 prostate cancer patients and assessed the spot-wise correlation of a gene signature activity for citrate-spermine secretion derived from ST with citrate, spermine, and zinc levels obtained by MSI. We confirmed a significant correlation between gene signature activity and all three metabolites. To conclude, we developed a highly accurate, customizable, computational framework for integrating spatial omics technologies and for registration of serial tissue sections.
  
  A version of this preprint has been published in the Open Access journal GigaScience (see paper https://doi.org/10.1093/gigascience/giaf035), where the paper and peer reviews are published openly under a CC-BY 4.0 license.
  
  Revision 1 version
  
  Reviewer 1: Hua Zhang
  
  The quality of this manuscript has significantly improved in this revision. I appreciate the author's effort in thoroughly addressing all concerns and comments.
  
  Reviewer 2: Santhoshi Krishnan
  
  All my concerns have been adequately addressed by the authors and I have no further questions.
2. GigaScience 03 Jun 2025
  
  in GigaScience
  
  To truly understand the cancer biology of heterogenous tumors in the context of precision medicine, it is crucial to use analytical methodology capable of capturing the complexities of multiple omics levels, as well as the spatial heterogeneity of cancer tissue. Different molecular imaging techniques, such as mass spectrometry imaging (MSI) and spatial transcriptomics (ST) achieve this goal by spatially detecting metabolites and mRNA, respectively. To take full analytical advantage of such multi-omics data, the individual measurements need to be integrated into one dataset. We present MIIT (Multi-Omics Imaging Integration Toolset), a Python framework for integrating spatially resolved multi-omics data. MIIT’s integration workflow consists of performing a grid projection of spatial omics data, registration of stained serial sections, and mapping of MSI-pixels to the spot resolution of Visium 10x ST data. For the registration of serial sections, we designed GreedyFHist, a registration algorithm based on the Greedy registration tool. We validated GreedyFHist on a dataset of 245 pairs of serial sections and reported an improved registration performance compared to a similar registration algorithm. As a proof of concept, we used MIIT to integrate ST and MSI data on cancer-free tissue from 7 prostate cancer patients and assessed the spot-wise correlation of a gene signature activity for citrate-spermine secretion derived from ST with citrate, spermine, and zinc levels obtained by MSI. We confirmed a significant correlation between gene signature activity and all three metabolites. To conclude, we developed a highly accurate, customizable, computational framework for integrating spatial omics technologies and for registration of serial tissue sections.
  
  A version of this preprint has been published in the Open Access journal GigaScience (see paper (https://doi.org/10.1093/gigascience/giaf035)), where the paper and peer reviews are published openly under a CC-BY 4.0 license.
  
  Original Submission Reviewer 1: Hua Zhang
  
  Wess et al reports a Python framework, MIIT (Multi-Omics Imaging Integration Toolset), for integrating spatially resolved multi-omics data. Multi-omics imaging represents a pivotal approach for systems molecular biology and biomarker discovery. This method introduces a timely and valuable tool to advance the field. However, in my opinion, this paper still has some issues that need to be addressed before consideration for publication. Cancer tissue exhibits significant heterogeneity effects, in this study, different molecular information obtaining from different tissue sections, this means from different cells as the tissue section is 10 um thickness, almost the diameter of the cells. Please height the meaningful of co-registration information if they are obtained from different cell layers. In particular, for the datasets of spatial transcriptomics and MSI, the experiments were conducted on serial sections with an axial sectioning distance of 40 to 100 Î¼m. This means that the mRNA and metabolites originate from different cells, raising questions about how integrating these two datasets can provide meaningful insights. The multi-omics imaging integration toolset is based on the GreedyFHist, a non-rigid registration algorithm, it suggests including more details about this algorithm and highlight the difference comparing to previously reported non-rigid image co-registration algorithm. The author should demonstrate the accuracy of background segmentation, it concerns certain low signal sample area would be removed in the denoising step. What is criterion to define the background region and sample region in the background segmentation.
  
  In the Method section, more details need to be included in the spatial transcriptomics part, what the spatial resolution of the 10x Genomics was used. As the MALDI resolution is 30 um, how the pixel alignment of the ST and MSI data if their spatial resolution is different. In the MALDI-MSI of prostate tissue, on tissue MS/MS data is missing to confirm the identification of target analytes of citrate, ZnCl3-, and spermine.
  
  **Reviewer 2: Santhoshi Krishnan **
  
  Overview: In this paper, the authors present the Multi-Omics Imaging Integration Toolset, which is a python framework for integration multiple spatial omics datatypes. To facilitate this, they also development a registration method (GreedyFHist) for jointly analyzing sequential tissue layers that have undergone different types of staining/phenotyping regimens. The method validation was done on a 244 fresh-frozen prostrate tissue sections. The highly detailed methods and results section is well appreciated and helps fully contextualize the significance of the study. The definitions of study-specific terms mentioned throughout the paper at the beginning are also appreciated. Data and Code Availability: Detailed code, tutorials and associated instructions have been made available for use by the public, which is appreciated. All systems requirements have also been explicitly laid out for ease of installation and use. The workflow examples provided are quite detailed; however, a more extensive codebase with stepwise explanations within the code will be appreciated. Data has not been made available publicly, except for the raw and processed spatial transcriptomics data; however, detailed and explicit instructions have been provided on data access, keeping in mind local regulations. Revisions: Major Revisions: 1. In recent years, a lot of other platforms, both free and paid, tend to support registration across multiple slides. For example, HALO has a registration feature available as well, along with a host of other open-source datatypes. In that regard, how is your platform different? 2. It is mentioned that downscaling occurs during the registration process in order to reduce runtime - how are nuances in features selected as registration landmarks preserved in such a case? 3. How is the fixed image determined in this case? The assumption would be that a standard H&E image is selected for this purpose- is that assumption, correct? 4. The authors have stated and justified their rationale for using the mentioned evaluation metrics in the paper. However, in the general image registration space, metrics such as the dice coefficient and jaccard index are commonly used and accepted. Is there a particular reason why these were not used as well? It would offer a more complete picture for the general user if these metrics were provided as well. 5. The validation of registering distance neighboring sections is quite a valuable contribution, as the authors rightly stated that in many multi-omics experiments, this might be a necessity. However, when looking at tissue sections that are 80-100 microns apart, it is quite likely that the set of cells that one may be looking at on the x-y coordinate system may not be the same at all; in fact, for a highly heterogeneous/flexible piece of tissue, they might be completely different. In such a circumstance, how much value is there in registering these two sections together instead of, say, separately analyzing them and using alternative methods to combine the results downstream? 6. In the proof of concept presented in the paper, the authors mention using ST and MSI data for validating their framework. Have they also investigated ST integration with more commonly available datatypes such as IHC/mIF? 7. The work that the authors have put in to validate the registration and MIIT framework using different approaches (selecting spatially distant slides, integration using augmented/artificial data) is thorough. However, different tissue types bring in their own challenges, and thus validation of this framework on an external dataset would lend more credence to this much needed framework, especially in the era of increased multiomics analyses.
  
  Minor Revisions: 1. Please ensure all typos/grammatical mistakes are corrected. 2. In the 'preprocessing of stained histology images', can more details be given on the thresholding process? It is also stated that the threshold is manually adjusted for each image if necessary - how is this determination done? 3. The headings/subheadings organizations within sections can be done in a more organized manner, in some parts it was challenging to determine the organization of sections/subsections. 4. Can some more details be given on the landmarks that were identified per image? Could some examples be provided on what these landmarks are, and how they remain consistence across tissue layers? 5. Currently, the way various samples are used for validating the GreedyFHist and MIIT frameworks are listed out in the paper is quite confusing. It would be appreciated if the authors can distinctly mention the number of samples out of the set of samples, and the associated stained slides are used for each. 6. How were the annotations from the 3 annotators cross validated?
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2024.06.11.598306v1
May 2025
www.biorxiv.org www.biorxiv.org

Chromosome-level genome assembly of the lemon sole Microstomus kitt (Pleuronectiformes: Pleuronectidae)

2
1. GigaScience 30 May 2025
  
  in GigaByte
  
  Editors Assessment:
  
  This Data Release paper presents the first genome assembly of the lemon sole (Microstomus kitt), a commercially important flatfish found in European coastal waters. It is also interesting that this work was carried out in a University course setting involving the students. The resulting chromosome-level genome was assembled using long-read PacBio HiFi sequencing and the Hi-C technique. The 628 Mbp reference (which is consistent with other Pleuronectidae fish species) is assembled into 24 chromosome-length scaffolds with high completeness, achieving a scaffold N50 of 27.2 Mbp. Peer review and data curation made the author clarify a few points and share all of the data and results in an open and well curated manner. The annotated genome of the lemon sole, with its high continuity, should therefore provide important reference data for future population genetic analyses and conservation strategies of this organism.
  
  This evaluation refers to version 1 of the preprint
  
  Summary
2. GigaScience 30 May 2025
  
  in GigaByte
  
  AbstractBackground The lemon sole (Microstomus kitt) is a culinary fish from the family of righteye flounders (Pleuronectidae) inhabiting sandy and shallow offshore grounds of the North Sea, the western Baltic Sea, the English Channel, the shallow waters of Great Britain and Ireland as well as the Bay of Biscay and the coastal waters of Norway.Findings Here, we present the chromosome-level genome assembly of the lemon sole. We applied PacBio HiFi sequencing on the PacBio Revio system to generate a highly complete and contiguous reference genome. The resulting assembly has a contig N50 of 17.2 Mbp and a scaffold N50 of 27.2 Mbp. The total assembly length is 628 Mbp, of which 616 Mbp were scaffolded into 24 chromosome-length scaffolds. The identification of 99.7% complete BUSCO genes indicates a high assembly completeness.Conclusions The chromosome-level genome assembly of the lemon sole provides a high-quality reference genome for future population genomic analyses of a commercially valuable edible fish.
  
  This work has been published in GigaByte Journal under a CC-BY 4.0 license (https://doi.org/10.46471/gigabyte.156), and has published the reviews under the same license.
  
  Reviewer 1. Alejandro Mechaly
  
  Are all data available and do they match the descriptions in the paper? No. The BioProject number is not included in the submitted manuscript.
  
  Are the data and metadata consistent with relevant minimum information or reporting standards? No. The BioProject number is not included in the submitted manuscript.
  
  Comments: The paper presents a valuable contribution to the genomics of Microstomus kitt (lemon sole), a commercially important species. The study introduces a chromosome-level genome assembly using PacBio HiFi sequencing, resulting in a highly contiguous assembly with 99.7% completeness in BUSCO genes. This high-quality genome will serve as a key resource for future population genomics and aquaculture studies. Overall, this assembly offers a solid foundation for advancing research on the biology and management of lemon sole. The main critique of this study is that, while it highlights the sexual dimorphism in lemon sole, where females are larger than males, it does not delve into this aspect in detail. Although the research presents valuable data through a high-quality chromosomal-level genome assembly, it focuses exclusively on male specimens. Comparing the genomes of both sexes would be highly insightful, potentially revealing the genetic mechanisms or pathways underlying this dimorphism through comparative genomics. Recent studies on flatfish (Villarreal et al., 2024. https://doi.org/10.1186/s12864-024-10081-z) have used comparative genomics to examine sex determination genes, and applying this approach to lemon sole would significantly enhance the study’s impact. Furthermore, there are numerous sequenced flatfish genomes that should be analyzed alongside these results to provide a more comprehensive context.
  
  Re-review: Thank you for addressing my comments. While I understand the study's limitations, including its focus as part of a university course and the use of a single specimen, I believe the manuscript lacks sufficient impact without exploring the genetic basis of sexual dimorphism or incorporating comparative analyses with other flatfish genomes. The genome assembly and annotation are well-executed, but the absence of biological context limits the broader relevance of the work. Sexual dimorphism in lemon sole, a commercially important species, is a key topic that could inform aquaculture and fisheries management. Without addressing this, the manuscript misses an opportunity to answer important scientific questions. For these reasons, I cannot recommend the manuscript for publication in its current form. While the technical work is solid, additional analyses or a broader scope are needed to enhance its contribution to the fieldS
  
  Reviewer 2. Yongshuang Xiao
  
  This MS presents the chromosome-level genome assembly of Microstomus kitt, a species belonging to the Pleuronectidae family and mainly distributed in the North European seas. The study utilized PacBio HiFi sequencing technology combined with Hi-C data for chromosome-level assembly, resulting in a high-quality reference genome of approximately 633 MB, including 23 chromosomal length scaffolds, completing 99.7% of BUSCO genes, demonstrating high assembly completeness and gene annotation quality. Further analysis revealed abundant repetitive sequences and gene features in the lemon sole genome, providing important resources for future genetic studies of this species and its close relatives. The paper presents several issues as follows: 1. From the evaluation of the genome, the estimated size is around 542 Mb, while the manually curated Hi-C results yielded a genome size of 633 Mb. The authors are requested to explain why there is a difference of nearly 100 Mb between the second-generation sequencing evaluation and the third-generation results. 2. Utilizing PacBio HiFi sequencing technology, which generates long reads, and its associated assembly software, the authors were able to assemble the genome at the chromosome level. The authors explicitly state that the size of the 23 chromosomal level genomes assembled using YaHS and Chromap software is around 500 Mb, which is consistent with the genome survey results. How does the author know that the assembled genome is erroneous? 3. Based on the author's description, it is not clear what the size of the assembled genome from a single chain using PacBio sequencing is. The author needs to provide this data in the results. 4. The authors performed quality assessments of the assembled genome using various methods such as Merqury. However, the description of the evaluation results is lacking. The authors are requested to include the QV evaluation values and additional results of SNP alignment for the second-generation sequencing data. 5. For gene annotation, the authors used the genomes of five species of Pleuronectidae as references. We are eager to see the results of the alignment analysis between the genome obtained using PacBio Revio and the aforementioned five fish genomes. Although these results do not need to be included in the main text, they should be provided as part of the response to the reviewers, including the alignment results and alignment rates for both sets of assembled genomes (500 Mb and 633 Mb). 6. The authors are requested to include the length information of each chromosome in the supplementary files. From the assembly results, it appears that the PacBio Revio results are not as impressive as anticipated, particularly with a Scaffold N50 of 29.4 Mbp. Is this due to limitations in the length of the chromosomes themselves, affecting the quality metrics of this genome? 7. The data should be uploaded to NCBI and obtain the corresponding registration code.
  
  Re-review: This study aims to perform chromosome-level genome assembly of the lemon sole (Microstomus kitt) and conduct a comprehensive analysis of its genome using high-throughput sequencing technology. Researchers utilized PacBio HiFi sequencing technology to carry out whole-genome sequencing of this species, resulting in a high-quality and complete genome sequence. The genome sequence has a length of 633 Mbp, with 23 chromosome-level sequences successfully assembled. Additionally, BUSCO analysis indicated that this genome sequence possesses a high level of completeness. These results suggest that the lemon sole genome sequence can serve as an important reference for future population genetic studies of commercially valuable edible fish species. However, there are certain issues with the paper that need to be addressed: The authors emphasize that female lemon soles grow larger than males, yet they chose to sequence the male genome instead of focusing on the more unique female. The authors should clarify this choice. The HI-C assisted assembly results show that male lemon soles have 23 chromosome pairs. Are there any heteromorphic chromosomes? The authors need to elucidate the karyotype of the lemon sole, as this information is significant for both the genome assembly and subsequent research. The survey results indicate a high level of heterozygosity in lemon sole. How did the authors account for this high heterozygosity to obtain a relatively complete genome? Could this affect the accuracy of the genome? Although the authors achieved high-quality genome results through PacBio sequencing, they used BUSCO for genome quality assessment. To further highlight the completeness and accuracy of the assembled genome, it is recommended that the authors utilize QV for additional evaluation. To ensure high levels of data sharing and reproducibility, the authors are requested to provide the chromosome-level genome fasta file and gff annotation file. In summary, the authors are encouraged to provide additional information and make necessary revisions.
Visit annotations in context

Tags

Summary

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2025.04.29.651060v1
www.biorxiv.org www.biorxiv.org

Chromosome-level genome assemblies of five Sinocyclocheilus species

2
1. GigaScience 14 May 2025
  
  in GigaByte
  
  **Editors Assessment: ** Sinocyclocheilus are a genus of freshwater cavefish fish that are endemic to the Karst regions of Southwest China. Having diverse traits in morphology, behavior, and physiology typical of cavefish, that make them interesting models for studying cave adaptation and phylogenetic evolution. The manuscript assembled chromosomal-level genomes of five Sinocyclocheilus species, and conducted allotetraploid origin analysis on these species. Assembling S. grahami (the golden-line barbel), using PacBio and Hi-C sequencing technologies, a final chromosome-level genome assembly was 1.6 Gb in size with a contig N50 of 738.5 kb and a scaffold N50 of 30.7 Mb. With 93.1% of the assembled genome sequences and 93.8% of the predicted genes anchored onto 48 chromosomes. Subsequently the authors conducted a homologous comparison to obtain chromosome-level genome assemblies for four other Sinocyclocheilus species: S. maitianheensis, S. rhinocerous, S. anshuiensis, and S. Anophthalmus. With over 82% of the genome sequences anchored on these constructed chromosomes. Peer review provided clarification on the assembly strategy and provided more benchmarking. This data having the potential to contribute to species conservation and the exploitation of potential economic and ecological values of diverse Sinocyclocheilus members.
  
  This evaluation refers to version 1 of the preprint
  
  Summary
2. GigaScience 14 May 2025
  
  in GigaByte
  
  ABSTRACTSinocyclocheilus, a genus of tetraploid fishes, is endemic to the karst regions of Southwest China. All species within this genus are classified as second-class national protected species due to their unique and fragile habitat. However, absence of high-quality genomic resources has hindered various research efforts to elucidate their phylogenetic relationships and the origin of polyploidy. To address these academic challenges, we at first constructed a high-quality genome assembly for the most abundant representative, golden-line barbel (Sinocyclocheilus grahami), by integration of PacBio long-read and Hi-C sequencing technologies. The final scaffold-level genome assembly of S. grahami is 1.6 Gb in length, with a scaffold N50 up to 30.7 Mb. A total of 42,205 protein-coding genes were annotated. Subsequently, 93.1% of the assembled genome sequences (about 1.5 Gb) and 93.8% of the total predicted genes were successfully anchored onto 48 chromosomes. Furthermore, we obtained chromosome-level genome assemblies for four other Sinocyclocheilus species (including S. anophthalmus, S. maitianheensis, S. anshuiensis, and S. rhinocerous) based on homologous comparison. These genomic data we present in this study provide valuable genetic resources for in-depth investigation on cave adaptation and improvement of economic values and conservation of diverse Sinocyclocheilus fishes.
  
  Reviewer 1. Jun Wang
  
  The manuscript assembled chromosomal-level genomes of five Sinocyclocheilus species, and conducted allotetraploid origin analysis on these species. The manuscript was meaningful and provided valuable genome resources in Sinocyclocheilus genus, which will further help with the evolution and functional genomics of these species. The analysis was accurate, and the results were solid. My comments are as follows
  
  Please detail the method how you assembled four other species on homologous comparison? You just map the assembled scaffold to the reference genome?
  
  In the manuscript, the author only provide the sequencing info of S. grahami but not the other four species. What are the sequencing information of other four species, like how many reads have been sequenced with Illumina?
  
  There was no results description for figure 2 and why there are there only repeat annotation results for S. grahami and not the other four species?
  
  Reviewer 2. Fei Li and Shili Li
  
  This paper entitled “Chromosome-level genome assemblies of five Sinocyclocheilus species” reported a chromosome-level golden-line barbel genome by using combination of Pacbio and Hi-C data. Using this chromosome-level assembly as reference, the author also constructed other four psedo chromosome-level assemblies of S. anophthalmus, S. maitianheensis, S. anshuiensis, and S. rhinocerous. These data are really important resource for conservation of these endangered species. However, some important results have not shown: 1. Protein BUSCO result has not been shown. 2. Raw reads were not uploaded to NCBI. 3. What’s the detailed number for functional annotation.
  
  Some minor suggestions: Add “,” before “and conservation”. What’s the meaning of “R & D”? Line 58, “a good model” should be “good models”. Line 64, remove “at first”. Line 84, change “a” to “the”. Line 90, change ‘muscle’ to “muscle tissue”. Line 105, remove ‘which was’. Line 112, remove ‘this study’. Line 122, change “Repeat annotation, gene prediction, and function prediction” to “Annotation of repeat, gene and function”. Line 137, ‘with’ should be ‘by using’. Line 127, remove ‘(TEs)’. Line 134, What’s meaning of NCBI GenBank? Remove GenBank. Line 140, ‘was’ should be ‘were’. Line 178, ‘Species’ should be ‘species’.
Visit annotations in context

Tags

Summary

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2025.02.27.640546v1
www.biorxiv.org www.biorxiv.org

RWRtoolkit: multi-omic network analysis using random walks on multiplex networks in any species

1
1. GigaScience 06 May 2025
  
  in GigaScience
  
  Leveraging the use of multiplex multi-omic networks, key insights into genetic and epigenetic mechanisms supporting biofuel production have been uncovered. Here, we introduce RWRtoolkit, a multiplex generation, exploration, and statistical package built for R and command line users. RWRtoolkit enables the efficient exploration of large and highly complex biological networks generated from custom experimental data and/or from publicly available datasets, and is species agnostic. A range of functions can be used to find topological
  
  Reviewer name: Francis Agamah Reviewer Comments: The paper introduces a species agnostic random walk with restart toolkit built for R and command line users. The tool enables constructions of multiplex networks from any set of data layers and enables the discovery of gene-to-gene relationships. The tool offers a collection of functions for network analysis. Overall, the tool is a significant contribution to network analysis. Major Comments The manuscript's background section should provide a more comprehensive overview of the rationale behind the development of RWRtoolkit. It should clearly outline the existing RWR implementation tools, identify the gaps in these tools, and explain how RWRtoolkit addresses these limitations or offers a new approach. To demonstrate the effectiveness of RWRtoolkit, the authors could evaluate the ranking performance against other established random walk with restart algorithms that can handle heterogeneous multiplex networks. Additionally, a detailed explanation of the scoring approach implemented in RWRtoolkit is necessary to justify its choice and potential advantages. The authors have indicated in the section "network layer and multiplex statistics" that the tau parameter affects the probability of the walker visiting each specific layer. To address potential bias issues in the network exploration, it would be beneficial to provide an exploration of the parameter space and indicate how it informs the stability of the RWR output scores under variations of the various algorithm parameters.
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2024.07.17.603975v1
www.biorxiv.org www.biorxiv.org

Defining the limits of plant chemical space: challenges and estimations

2
1. GigaScience 06 May 2025
  
  in GigaScience
  
  model, a de novo prediction model, a combination of library search and de novo prediction, and MS2 clustering—to estimate the number of unique structures. Our methods suggest that the number of unique compounds in the metabolomics dataset alone may already surpass existing estimates of plant chemical diversity. Finally, we project these findings across the entire plant kingdom, conservatively estimating that the total plant chemical space likely spans millions, if not more, with the vast majority still unexplored.
  
  Reviewer name: Kohulan Rajan Reviewer Comments: Review: Defining the limits of plant chemical space: challenges and estimations This work presents an important contribution to understanding the chemical diversity of plants through a systematic analysis combining metabolomics data and literature mining. The authors address a question in the field and employs multiple complementary approaches to estimate the size of the plant chemical space. Here are my few suggestions and question to the authors to clarify, 1. When introducing an abbreviation one could use caption letters "Natural Products (NP)" 2. There is no list of abbreviations in the document, so introduce them first and then use them. There may be some readers who are unfamiliar with the terms COCONUT and LOTUS. 3. Is there any prior work using similar combined metabolomics/literature approaches to estimate plant chemical space? If so, these should be cited. If not, please state this explicitly to highlight the novelty of your method. 4. Cite SMILES 5. While the paper describes the use of 'literature datasets,' it appears that only existing databases (COCONUT and LOTUS) are being utilized. It would be helpful if authors could clarify whether any direct literature mining was conducted. If not, consider revising terminology to more accurately reflect the use of curated databases rather than primary literature sources. 6. Great to see the data and code openly shared on both Zenodo and GitHub. I also find the GitHub repository very useful with regard to all the provided notebooks. To maximize reusability, please consider adding a detailed "How to Use" section to the README that guides others in replicating or building upon this work. 7. The different clustering thresholds (0.7 vs 0.8) lead to notably different estimates. Could you discuss which threshold might be more appropriate for this specific application to plant metabolomics data?
2. GigaScience 06 May 2025
  
  in GigaScience
  
  The plant kingdom, encompassing nearly 400,000 known species, produces an immense diversity of metabolites, including primary compounds essential for survival and secondary metabolites specialized for ecological interactions. These metabolites constitute a vast and complex phytochemical space with significant potential applications in medicine, agriculture, and biotechnology. However, much of this chemical diversity remains unexplored, as only a fraction of plant species have been studied comprehensively. In this work, we estimate the size of the plant chemical space by leveraging large-scale metabolomics and literature datasets. We begin by examining the known chemical space, which, while containing at most several hundred thousand unique compounds, remains sparsely covered. Using data from over 1,000 plant species, we apply various mass spectrometry-based approaches—a formula prediction
  
  Reviewer name: Carlos RodrÃ-guez-LÃ³pez Reviewer Comments In the reviewed manuscript, Chloe Engler Hart et al. utilize different approaches to estimate the size of plant chemical space through analysis of publicly available datasets of mass spectrometry-based metabolomics. The authors tackle this issue by using data from ca. 2,000 LC-MS runs, and different formula predictors and structure annotation algorithms, and extrapolate to the estimated number of plant species. While the approach is useful at estimating structural variation, and the collected data and here-published source code can certainly be of use to the plant metabolomics community, I consider the manuscript requires modifications before it can be recommended for publication. Particularly, the language of the article should more accurately reflect the nature of this estimate; for example, mentions of the approach being "the most accurate estimate possible" (p.8, section 3.2) are not supported, and throughout the article, mentions of the calculation as a "conservative estimate" are not consistent with the approaches used, beyond formula prediction. E.g. it is mentioned that the MS2 curve being lower than formula prediction suggests that the curves may be conservative without further clarification on why this might be the case and not, e.g., a product of estimates dispersion. The authors mention that since they identify most limitations (in table 2, p. 13) are underestimations (again, with limited or no explanation) their estimate is conservative. Since no effect size can be calculated on these limitations, this statement is not true; e.g. if the approach is missing half of molecules due to extraction, and another half due to tissue coverage (total, Â¼), but overestimating the plateau of plant chemical diversity by 100-fold, even if more factors underestimate the chemical space, the effect size of the latter would be dominant by far. I recommend the authors to change mentions of this estimate being a conservative approach, and instead clearly mention that this is a fragmentation-based estimate, or a similar term that better reflects the nature of the figure. Similarly, assumptions on the models should be explicitly stated, along with their limitations. The authors, for example, rely on CID induced fragmentation, and they mention that the estimate "[relies] on the predominant adduct ([M+H]+)" (p.15) and thus "this likely underestimates the true chemical diversity, as other adduct forms" (p.15). It should be stated that this is an assumption: the authors do not have evidence for the adducts being [M+H]+, which is nigh impossible with the available data, they are assuming all features are [M+H]+ adducts. This carries the implicit assumption that fragmentation mechanisms will be the same for all MS2 spectra and thus structural diversity can be estimated through MS2 clusters. It is unclear how this would yield an underestimation, as the authors claim, but rather yields an overestimation, as fragmentation of [M+H]+ and e.g. [M+Na]+ adducts of the same molecule would yield different fragmentation patterns, given the former favors charge migration dependent mechanisms compared to the latter. Thus, since the authors consider all features to be [M+H]+, two adducts of the same molecule might be considered as different moieties, given that fragmentation patterns will differ, even if no difference exists. On the same vein, since similarity thresholds of the MS2Mol algorithm are essential for the estimation of diversity, the authors should clearly state how are they calculated in text, not by reference, along with potential limitations. Finally, I believe the work would greatly benefit from including data on phylogenetics of the samples, adding diversity estimates to their sample and extrapolation data. If, for example, most of the 400,000 plant species are phylogenetically distant from the sampled species, then the reader can reasonably assume that this might be an underestimation of chemical diversity when presented with the evidence. If, on the other hand, the original sample has more diversity than the total number of plant species, this might not be the case. In any case, all of the relevant assumptions should be clearly stated. Minor note: One of the main arguments for extrapolating the diversity estimate into the rest of the plants comes from Figure 3D, where increasing MS1 adducts increases with number of samples; it would greatly help explaining the difference seen between species if the authors clarify the tissues sampled per species. E.g. if the species that only doubles the number of features contains only aerial and vegetative tissue, compared to the species that increases 6fold which might include root or reproductive tissue, etc. This might also help the authors in justifying the extrapolation of the estimate.
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2025.01.08.631938v1
Apr 2025
www.biorxiv.org www.biorxiv.org

Efficiently Constructing Complete Genomes with CycloneSEQ to Fill Gaps in Bacterial Draft Assemblies

3
1. GigaScience 28 Apr 2025
  
  in GigaByte
  
  Editors Assessment:
  
  With the recent official launch of BGI’s new CycloneSEQ sequencing platform that delivers long-reads using novel nanpores, this paper presents benchmarking data and validation studies comparing short, long-rea data from other platforms and hybrid assemblies. This study tests the performance of the new platform in sequencing diverse microbial genomes, presenting raw and processed data to enable others to scrutinise and verify the work. Being openly peer-reviewed, and having scripts and protocols also shared for the first time helps provide transparency in this benchmarking process to increase trust in this new technology. On top of benchmarking typed strains, the technology also was tested with complex microbial communities. Yielding complete metagenome-assembled genomes (MAGs) which were not achieved by short- or long-read assemblies alone. By directly reading DNA molecules without fragmentation, the study demonstrating CycloneSEQ delivers long-read data with impressive length and accuracy, unlocking gaps that short-read technologies alone cannot bridge. Future work is expanding to real samples, with and fine-tuning the balance between short-read and long-read data for even faster, higher-quality assemblies.
  
  This evaluation refers to version 1 of the preprint
  
  Summary
2. GigaScience 28 Apr 2025
  
  in GigaByte
  
  Competing Interest StatementThe CycloneSEQ was developed by BGI-Research and will be marketed as an advanced technology. All the authors are employees of BGI-Research and may potentially benefit from it.
  
  See also the Ryan Wick Blog reviewing the preprint: https://rrwick.github.io/2024/12/17/cycloneseq.html
3. GigaScience 28 Apr 2025
  
  in GigaByte
  
  AbstractBackground Current microbial sequencing relies on short-read platforms like Illumina and DNBSEQ, favored for their low cost and high accuracy. However, these methods often produce fragmented draft genomes, hindering comprehensive bacterial function analysis. CycloneSEQ, a novel long-read sequencing platform developed by BGI-Research, its sequencing performance and assembly improvements has been evaluated.Findings Using CycloneSEQ long-read sequencing, the type strain produced long reads with an average length of 11.6 kbp and an average quality score of 14.4. After hybrid assembly with short reads data, the assembled genome exhibited an error rate of only 0.04 mismatches and 0.08 indels per 100 kbp compared to the reference genome. This method was validated across 9 diverse species, successfully assembling complete circular genomes. Hybrid assembly significantly enhances genome completeness by using long reads to fill gaps and accurately assemble multi-copy rRNA genes, which unable be achieved by short reads solely. Through data subsampling, we found that over 500 Mbp of short-read data combined with 100 Mbp of long-read data can result in a high-quality circular assembly. Additionally, using CycloneSEQ long reads effectively improves the assembly of circular complete genomes from mixed microbial communities.Conclusions CycloneSEQ’s read length is sufficient for circular bacterial genomes, but its base quality needs improvement. Integrating DNBSEQ short reads improved accuracy, resulting in complete and accurate assemblies. This efficient approach can be widely applied in microbial sequencing.
  
  This work has been published in GigaByte Journal under a CC-BY 4.0 license (https://doi.org/10.46471/gigabyte.154), and has published the reviews under the same license.
  
  Reviewer 1. Ryan Wick
  
  This manuscript introduces CycloneSEQ data as a means for producing complete bacterial genome assemblies, with a focus on hybrid assemblies made using a combination of CycloneSEQ data and DNBSEQ data. It also publicly provides deep CycloneSEQ+DNBSEQ read sets for a range of bacterial species.
  
  Major comments
  
  The reads for the project were made publicly available via CNGBdb (https://db.cngb.org/search/project/CNP0006129), but I found it to be unusably slow (both the HTTP website and the FTP data downloads). To ensure the data is accessible to a wide audience, I request that it also be hosted in another location to make it available to readers. For example, SRA, ENA or GigaDB.
  
  The paper makes no mention of the other major long-read platforms: Oxford Nanopore Technologies and Pacific Biosciences. Given the widespread use of these platforms (especially ONT) in bacterial genome assembly, some discussion on CycloneSEQ’s relative advantages or limitations would be beneficial.
  
  Minor comments
  
  Lines 100-103: this sentence (‘The GC content was sensitively affected…’) is not clear to me. How are the completeness and accuracy of the assembly affecting GC content?
  
  Figure S2 unnecessarily includes reference-vs-reference difference counts, which are by definition zero.
  
  Figure S2 could mention the genome (Akkermansia muciniphila ATCC BAA-835) in the caption – I did not immediately understand what 'for type strain' meant.
  
  I found Figure 5 difficult to read, with its use of colour to indicate accuracy. This data would be better shown using another visualisation (e.g. bar plot) that more clearly shows quantitative values.
  
  For the mixed microbial community analysis, it should be stated that Unicycler is exclusively designed for bacterial isolates (its documentation explicitly says to not use it on metagenomes).
  
  Some of the supplementary figures are erroneously labelled 'Supplementary Table'.
  
  Some stats on the metagenomic reads would be helpful: e.g. total bp for short and long reads, N50 for long reads, etc.
  
  The methods describe using seqtk, but the reference for this (#25) is SeqKit (a different tool), so either the tool in the methods or the reference is wrong. Re-review: Thank you for the revisions to the manuscript. While many of my minor comments have been addressed, I still have concerns regarding my major comments, which have not been fully resolved.
  
  First, I appreciate that the data has now been made available on NCBI. However, the long-read datasets are labelled as Oxford Nanopore MinION data, which is misleading (example: SRR31850034). I understand this may be because SRA does not yet provide CycloneSEQ as a platform option, but this can be clarified through additional metadata. Specifically, the ‘design’ field for each SRA entry simply says ‘genome’, but it could have more detail, including that these are CycloneSEQ reads. The BioProject (PRJNA1194773) description could also include a clear statement that the long-read data is generated using CycloneSEQ.
  
  Second, I had requested a brief discussion of existing long-read platforms (ONT and PacBio) to provide context on where CycloneSEQ fits into the broader sequencing landscape. The authors have chosen not to include this, stating that they do not have direct comparison data. While I understand that such a comparison is not the purpose of this paper, I still believe that some mention of these platforms is necessary in the Background and/or Discussion sections. This paper introduces a new long-read technology for bacterial genome assembly, and readers will naturally want to understand how it relates to widely used alternatives.
  
  Finally, regarding my comment about supplementary figure labels, I still see the issue in the revised version provided for review. For example, the caption for Supplementary Figure S3 begins with ‘Supplementary Table S3.’ The authors stated that there were no errors, but this mislabelling remains in the PDF I received.
  
  As these concerns remain unresolved, I do not consider the manuscript acceptable in its current form.
  
  Reviewer 2. Keith Robison
  
  As Open Source Software are there guidelines on how to contribute, report issues or seek support on the code?
  
  N/A - no software presented (relates to other software questions)
  
  Additional comments: This is a useful presentation of an emerging sequencing platform.
  
  Given the complex nature of nanopore signals and the difficulty of decoding them, it has been a pattern with the prior nanopore platform that improvements in basecalling software have yielded significant changes in basecalling performance. Therefore, it would be highly advantageous if the manuscript listed which specific versions / revision numbers of the basecalling software were used so that these results are properly contextualized for comparison to future results which may use newer basecalling software.
  
  Ideally, the publication would include a link to git (or similar) repository with the complete pipeline used to generate the results
Visit annotations in context

Tags

Summary

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2024.09.05.611410v1
www.biorxiv.org www.biorxiv.org

Network-based anomaly detection algorithm reveals proteins with major roles in human tissues

2
1. GigaScience 28 Apr 2025
  
  in GigaScience
  
  Background Anomaly detection in graphs is critical in various domains, notably in medicine and biology, where anomalies often encapsulate pivotal information. Here, we focused on network analysis of molecular interactions between proteins, which is commonly used to study and infer the impact of proteins on health and disease. In such a network, an anomalous protein might indicate its impact on the organism’s health.Results We propose Weighted Graph Anomalous Node Detection (WGAND), a novel machine learning-based method for detecting anomalies in weighted graphs. WGAND is based on the observation that edge patterns of anomalous nodes tend to deviate significantly from expected patterns. We quantified these deviations to generate features, and utilized the resulting features to model the anomaly of nodes, resulting in node anomaly scores. We created four variants of the WGAND methods and compared them to two previously-published (baseline) methods. We evaluated WGAND on data of protein interactions in 17 human tissues, where anomalous nodes corresponded to proteins with major roles in tissue contexts. In 13 of the tissues, WGAND obtained higher AUC and P@K than baseline methods. We demonstrate that WGAND effectively identified proteins that participate in tissue-specific processes and diseases.Conclusion We present WGAND, a new approach to anomaly detection in weighted graphs. Our results underscore its capability to highlight critical proteins within protein-protein interaction networks. WGAND holds the promise to enhance our understanding of intricate biological processes and might pave the way for novel therapeutic strategies targeting tissue-specific diseases. Its versatility ensures its applicability across diverse weighted graphs, making it a robust tool for detecting anomalous nodes.Competing Interest StatementThe authors have declared no competing interest.
  
  Reviewer 2. Dan Shao
  
  This manuscript provides an approach to highlight critical proteins within protein-protein interaction networks by Weighted Graph Anomalous Node Detection (WGAND). I see a lot of serious issues, as follows.
  
  Overall, the author submitted the article to GigaScience, so the problem he needs to solve should be the protein-disease relationship rather than anomaly detection in graphs. However, from the Abstract to the Introduction, the article always introduces the methods and applications of anomaly detection.
  
  Also, the logic of the whole article is confusing. There is a repetition of the specific method design in Methods (2.1 and 2.2). The overall program lacks method diagrams or flowcharts for explanation. In addition, the results should be in Results and not in Methods.
  
  The results do not go to the significant achievements and cannot fully reflect the superiority of the methods.
  
  Conclusion is missing from the text. 5.The use of the English language is very awkward at times.
  
  The font in some panels of some Figures (e.g., 6) is way too small.
  
  Re-review: Comments to the Authors The manuscript " Network-based anomaly detection algorithm reveals proteins with major roles in human tissues" triggered a positive initial impression, regarding abstract, introduction and figures, but going deeper, I see a lot of serious issues, as follows.
  
  Methods and Results are very hard to read at times. In many cases, where tools or parameters are used without further justification, the impression is given that various choices were tried extensively until some setup gave plausible results. In this study, the authors treated an anomaly as a node that behaves differently from most of the nodes in the network. However, the basis for this assumption requires further substantiation. The authors' research is fundamentally rooted in this premise, yet it is not adequately verified in the article. In the evaluation, the authors employed non-standard parameters to validate the effectiveness of the model. For example, they used the value of 24% associated with Mendelian disease among the top 10 proteins calculated by WGAND to compare with results obtained from other models. However, is this method of comparison credible? Results contain a lot details that I would expect to be part of Methods. Details of the model are missing in Methods. The use of the English language is very awkward at times. Minor, nice to have
  
  The font in some panels of some Figures (e.g., 2) is way too small.
  
  If a Figure consists of more than one part, e.g. A part, B part, each part should be explained separately.
  
  In the explanatory part of Figure 5, (a) (b) ... should be replaced by (A) (B) .... to maintain consistency with the figure.
2. GigaScience 28 Apr 2025
  
  in GigaScience
  
  AbstractBackground Anomaly detection in graphs is critical in various domains, notably in medicine and biology, where anomalies often encapsulate pivotal information. Here, we focused on network analysis of molecular interactions between proteins, which is commonly used to study and infer the impact of proteins on health and disease. In such a network, an anomalous protein might indicate its impact on the organism’s health.Results We propose Weighted Graph Anomalous Node Detection (WGAND), a novel machine learning-based method for detecting anomalies in weighted graphs. WGAND is based on the observation that edge patterns of anomalous nodes tend to deviate significantly from expected patterns. We quantified these deviations to generate features, and utilized the resulting features to model the anomaly of nodes, resulting in node anomaly scores. We created four variants of the WGAND methods and compared them to two previously-published (baseline) methods. We evaluated WGAND on data of protein interactions in 17 human tissues, where anomalous nodes corresponded to proteins with major roles in tissue contexts. In 13 of the tissues, WGAND obtained higher AUC and P@K than baseline methods. We demonstrate that WGAND effectively identified proteins that participate in tissue-specific processes and diseases.Conclusion We present WGAND, a new approach to anomaly detection in weighted graphs. Our results underscore its capability to highlight critical proteins within protein-protein interaction networks. WGAND holds the promise to enhance our understanding of intricate biological processes and might pave the way for novel therapeutic strategies targeting tissue-specific diseases. Its versatility ensures its applicability across diverse weighted graphs, making it a robust tool for detecting anomalous nodes.
  
  This work has been peer reviewed in GigaScience (https://doi.org/10.1093/gigascience/giaf034), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 1. Yong Zhang
  
  This study introduces the WGAND method, an innovative weighted graph anomaly detection algorithm to identify key anomalous proteins in human tissues using machine learning techniques. Given the critical role of abnormal proteins in disease prediction and treatment, this research area is pivotal for understanding complex systems' dynamic behaviors, especially in bioinformatics. In general, this article contributes to weighted graph anomaly detection. While this study provides valuable insights and demonstrates the WGAND method's good performance and practicality, here are some suggestions and potential directions for improvement:
  
  Building on existing research, conducting a detailed performance comparison analysis between the WGAND algorithm and similar cutting-edge methods (such as OddBall, Yagada, etc.) is recommended, explicitly highlighting WGAND's advantages in anomaly detection accuracy. A series of standard metrics should be used, including but not limited to precision, recall, F1 score, and AUC curve, to quantify WGAND's effectiveness and superiority rigorously.
  
  While AUC and P@K are valuable as main evaluation metrics, introducing additional metrics such as recall, precision, and F1 score for anomaly detection tasks can provide a more comprehensive assessment of model performance.
  
  Delve into optimizing the selection of node embedding methods and edge weight estimators based on different application scenarios and explore more systematic model selection and hyperparameter optimization strategies.
  
  Investigate strategies for dynamically setting thresholds to allow the WGAND method to adapt to changes in the data environment and various task demands.
  
  Discuss the applicability of WGAND across different types of weighted graphs (such as undirected and directed graphs) and assess its generality and adaptability.
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2023.12.19.572354v1
www.biorxiv.org www.biorxiv.org

Genome assembly and annotation of Acropora pulchra from Mo’orea, French Polynesia

2
1. GigaScience 13 Apr 2025
  
  in GigaByte
  
  Editors Assessment:
  
  Acropora pulchra is a species small polyped stony corals in the family Acroporidae from the the Indo-Pacific. This Data Release is the first study in stony corals to present the DNA methylome in tandem with a high-quality genome assembled utilizing PacBio long-read HiFi sequencing. Sequencing an A. pulchra specimen from Mo’orea, French Polynesia. From this single molecule sequencing data DNA methylation data was also called and quantified, and additional short-read Illumina RNASeq data was used for gene annotation. This producing an assembly size is 518 Mbp, with 174 scaffolds, and a scaffold N50 of 17 Mbp, and 40,518 protein-coding genes called. Peer review requested some improved benchmarking, and it is impressive to see from the results that the genome assembly represents the most complete and contiguous stony coral genome assembly to date. As an important indicator species and this data will hopefully serve as a resource to the coral and wider scientific community. Further quantification of the genome-wide methylation is needed aid the study epigenetics of non-model organisms, and specifically future analyses on methylation in coral.
  
  *This evaluation refers to version 1 of the preprint *
  
  Summary
2. GigaScience 13 Apr 2025
  
  in GigaByte
  
  AbstractReef-building corals are integral ecosystem engineers in tropical coral reefs worldwide but are increasingly threatened by climate change and rising ocean temperatures. Consequently, there is an urgency to identify genetic, epigenetic, and environmental factors, and how they interact, for species acclimatization and adaptation. The availability of genomic resources is essential for understanding the biology of these organisms and informing future research needs for management and and conservation. The highly diverse coral genus Acropora boasts the largest number of high-quality coral genomes, but these remain limited to a few geographic regions and highly studied species. Here we present the assembly and annotation of the genome and DNA methylome of Acropora pulchra from Mo’orea, French Polynesia. The genome assembly was created from a combination of long-read PacBio HiFi data, from which DNA methylation data were also called and quantified, and additional Illumina RNASeq data for ab initio gene predictions. The work presented here resulted in the most complete Acropora genome to date, with a BUSCO completeness of 96.7% metazoan genes. The assembly size is 518 Mbp, with 174 scaffolds, and a scaffold N50 of 17 Mbp. Structural and functional annotation resulted in the prediction of a total of 40,518 protein-coding genes, and 16.74% of the genome in repeats. DNA methylation in the CpG context was 14.6% and predominantly found in flanking and gene body regions (61.7%). This reference assembly of the A. pulchra genome and DNA methylome will provide the capacity for further mechanistic studies of a common coastal coral in French Polynesia of great relevance for restoration and improve our capacity for comparative genomics in Acropora and cnidarians more broadly.
  
  This work has been published in GigaByte Journal under a CC-BY 4.0 license (https://doi.org/10.46471/gigabyte.153). These reviews (including a protocol review) are as follows.
  
  Reviewer 1. Yanshuo Liang
  
  The manuscript by Conn et al. detail the high-quality genome assembly of Acropora pulchra, a Acropora of ecological and evolutionary significance, and also analyzes its genome-wide DNA methylation characteristics. These data complement the genetic resources of the Acropora genome. This manuscript is well written and represents a valuable contribution to the field. I have some comments below for the authors to address but look forward to seeing this research published. Q1: In the first sentence of the second paragraph of the Context: This is the first study to utilize PacBio long-read HiFi sequencing to generate a high quality genome with high BUSCO completeness, in tandem with its DNA methylome for scleractinian corals. Language such as "new", "first", "unprecedented", etc, should be avoided because it often leads to unproductive controversy. As far as I know, the genome you assembled is not the first stony coral to be sequenced using PacBio long-read HiFi sequencing. Back in 2024, He et al. assembled Pocillopora verrucosa (Scleractinia) to the chromosome level using PacBio HiFi long-read sequencing and Hi-C technology. Here I would suggest please rephrase. Reference： He CP, Han TY, Huang WL, et al. Deciphering omics atlases to aid stony corals in response to global change, 11 March 2024, PREPRINT (Version 1) available at Research Square [https://doi.org/10.21203/rs.3.rs-4037544/v1]. Q2: In this sentence: “On 23 October 2022, sperm samples were collected from the spawning of A.pulchra and preserved in Zymo DNA/RNA shield.” Please “A.pulchra” to “A. pulchra”. Q3: Please change all “k-mer” into “k-mer” in the manuscript. Q4: Please change “Long-Tandem Repeats” to “Long Terminal Repeats” Q5: In this sentence: “Funannotate train uses Trinity [18] and PASA [19] for ab initio predictions. Funannotate predict was then run to assign gene models using AUGUSTUS [20], GeneMark [21], and Evidence Modeler [19] to estimate final gene models.” Please write versions of these software. Q6: [20] Later references do not correspond well in the manuscript, please check!
  
  Reference 2. Jason Selwyn
  
  Is the language of sufficient quality? Yes. There are some minor grammatical issues throughout that warrent a closer reading to correct. E.g. Abstract: "...urgency to identify how genetic, epigenetic, and environmental...", "...management and and conservation...". Context: "...we aim to provide..." etc. Are all data available and do they match the descriptions in the paper? Yes. The link to the OSF repository in the PDF did not work. However, the link to the OSF repository from the github did work. Is the data acquisition clear, complete and methodologically sound? No. It isn't mentioned in the manuscript where the RNAseq data used to annotate the genome is from, nor any quality filtering steps that may have been applied to the RNA data prior to its use for annotation. Is there sufficient detail in the methods and data-processing steps to allow reproduction? Yes. Excluding the above comment about the RNA data. Additional Comments: This is a well assembled, and annotated genome that will contribute to the growing database of Acropora genomes. The manuscript could do with a simple pass to identify and correct some relatively minor grammatical issues and inconsistencies (Table 1 includes a thousands comma separator in some instances and not others) and needs to include details about the source of the RNA data used to train the ab initio gene predictors. There also appears to be a problem with the citation numbering after 20.
  
  **Reviewer 3. Benjamin Young ** Are all data available and do they match the descriptions in the paper? Yes. Raw reads, metadata, and genome assembly are publicly available and have a NCBI project number in which they are all linked. Is the data acquisition clear, complete and methodologically sound? Yes. Collection of sperm samples, HMW DNA extraction, and SMRT Bell Library prep are written clearly. I have asked for a few clarifications on wording in this section in the attached edited pdf document. Is there sufficient detail in the methods and data-processing steps to allow reproduction? Yes. I think the pipeline used for de-novo genome generation (including raw read cleaning and assembly), repeat masking, and gene prediction and annotation is of high quality and best practices. With the inclusion of the GitHub and all analyses scripts, it is possible to reproduce the assembly generated. Is there sufficient data validation and statistical analyses of data quality? Yes. This is not super relevant for a genome assembly paper so I have no additional comments here. Is the validation suitable for this type of data? Yes. The authors use tools such as GenomeScope2 and BUSCO for validation of their data. It would be nice to see the tool they used to identify N50 and L50 (maybe Quast) included in the methods. Additionally, I would like to see a Merqury analysis of the HifiAsm primary and alternate assemblies to show that duplicate purging was successful. Additional Comments: I would first like to commend the authors for a well assembled genome resource for a coral species that will be greatly beneficial to the wider coral and scientific community. I have provided a PDF with comments throughout for the authors to address. The majority of these are easy fixes, including things such as sentence structure, inconsistent capitalisation of subheadings, additional references for methods, clarification of statements, and other suggestions. I do have a few larger requests for this to be published, and these are the reasons for selecting the major revision option as there may need to be figure updates, and quick additional analyses to be run. 1. Can you please correct the verbiage around BUSCO analysis throughout the manuscript. It is often stated "BUSCO completeness of xx%". BUSCO doesn't directly measure completeness, rather completeness of single copy orthologs against a specific database. I have left comments throughout on potential rewording for these instances. Please also specify the exact database you used (i.e. odb10_metazoa). Finally, can you please be more specific when stating BUSCO results, specifically when you use 96.9% this is single copy and duplicated complete BUSCOS. I have left comments in the pdf again for this. 2. In the results for Genome Assembly section can you please include results (i.e. length, N50, L50, number contigs/scaffolds) for the primary assembly and the scaffolded assembly. 3. I think it would be not much work and provide additional information to show successful duplicate purging to run a Merqury analysis on the primary and alternative assemblies from HiFiAsm. 4. Can you include some additional information in the "Structural and Functional Annotation section". Specifically, can you provide information on the results from the funannoatate predict step, and then how funannotate update improved this (if at all). 5. Please double check the methods section for funannotate. From reading the funannoatate documentation I think there may be some confusion on what each step (train, predict, update, annotate) is doing. I have provided comments in the pdf to help clarify, and have also linked the funnannotate documentation. 6. On NCBI I see that an additional Acropora pulchra genome has just been made available (29th Jan 2025), with this to the chromosome level (https://www.ncbi.nlm.nih.gov/datasets/genome/GCA_965118205.1/). I think it would be prudent to include this assemblies statistics in your Table 1, and also run a BUSCO analysis on this other assembly to compare with your one. While they got to chromosome level, you do have markedly less contigs. I do not think this is necessary for this manuscript, but future work you could look to use their chromosome assembly to get your scaffolded assembly to chromosome level. Again, I want to say this is a wonderful resource for the coral and wider scientific community, and the pipeline for de-novo assembly and annotation is best practices in my opinion. Annotated additional file: https://gigabyte-review.rivervalleytechnologies.comdownload-api-file?ZmlsZV9wYXRoPXVwbG9hZHMvZ3gvRFIvNTk0L2Nvbm5ldGFsMjAyNV9yZXZpZXdjb21tZW50cy5wZGY=
  
  Re-review:
  
  The authors have addressed all my comments and queries, and included nearly all recommendations. Thank you ! A few quick notes to fix before publication -
  
  "The input created Funannotate train uses Trinity v.2.15.2 [22] and PASA v.2.5.3 [23] for transcript assembly prior to ab initio predictions". This sentence reads weird, reword before publishing. I think maybe just remove "created Funannotate train" and then it reads correctly. Or "Funnannotate trains uses .....". - "PFAM v.37.0 [28], CAZyme [29], UniProtKB v[30] and GO [31]." Missing a few version numbers, and UniProt just has a v. - "The mitochondrial genome was successfully assembled and circularized using MitoHifi v3.2.2 The final assembled A. pulchra mitogenome is". Just missing a period i think before "The final assembly". Great job and a very useful resource for the coral community !!
Visit annotations in context

Tags

Summary

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2025.03.27.645822v1
www.biorxiv.org www.biorxiv.org

Healthy microbiome - moving towards functional interpretation

2
1. GigaScience 03 Apr 2025
  
  in GigaScience
  
  AbstractMicrobiome-based disease prediction has significant potential as an early, non-invasive marker of multiple health conditions linked to dysbiosis of the human gut microbiota, thanks in part to decreasing sequencing and analysis costs. Microbiome health indices and other computational tools currently proposed in the field often are based on a microbiome’s species richness and are completely reliant on taxonomic classification. A resurgent interest in a metabolism-centric, ecological approach has led to an increased understanding of microbiome metabolic and phenotypic complexity revealing substantial restrictions of taxonomy-reliant approaches. In this study, we introduce a new metagenomic health index developed as an answer to recent developments in microbiome definitions, in an effort to distinguish between healthy and unhealthy microbiomes, here in focus, inflammatory bowel disease (IBD). The novelty of our approach is a shift from a traditional Linnean phylogenetic classification towards a more holistic consideration of the metabolic functional potential underlining ecological interactions between species. Based on well-explored data cohorts, we compare our method and its performance with the most comprehensive indices to date, the taxonomy-based Gut Microbiome Health Index (GMHI), and the high dimensional principal component analysis (hiPCA)methods, as well as to the standard taxon-, and function-based Shannon entropy scoring. After demonstrating better performance on the initially targeted IBD cohorts, in comparison with other methods, we retrain our index on an additional 27 datasets obtained from different clinical conditions and validate our index’s ability to distinguish between healthy and disease states using a variety of complementary benchmarking approaches. Finally, we demonstrate its superiority over the GMHI and the hiPCA on a longitudinal COVID-19 cohort and highlight the distinct robustness of our method to sequencing depth. Overall, we emphasize the potential of this metagenomic approach and advocate a shift towards functional approaches in order to better understand and assess microbiome health as well as provide directions for future index enhancements. Our method, q2-predict-dysbiosis (Q2PD), is freely available (https://github.com/Kizielins/q2-predict-dysbiosis).
  
  This work has been peer reviewed in GigaScience (https://doi.org/10.1093/gigascience/giaf015), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 2: Saritha Kodikara
  
  In this study, the authors present a novel metagenomic health index designed to differentiate between healthy and unhealthy microbiomes. This area of research is crucial for developing a non-invasive, cost-effective method to assess patient health status. However, I have several suggestions that I believe will enhance the study and address some key points.
  
  Main Comments:
  
  1.) The study would benefit from additional post-analysis to provide greater depth. Although the authors applied their approach to several diseases, they did not elaborate on the significance of individual microbiome features across different diseases. For instance, the GMHI parameters were identified as least important in IBD—does this observation hold universally across all diseases analysed?
  
  2.) The index Q2D performed worse in AGP1 compared to HMP2 and AGP2. Is there a specific reason for this discrepancy? For example, does the index underperform in the heterogeneous functional landscape presented in AGP1 (Figure 2C)? An explanation for the reduced performance in this cohort would provide valuable insights into the method's performance under varying conditions.
  
  3.) It would be beneficial to make all processed data and relevant scripts available in a GitHub repository to ensure that the results presented in the paper can be replicated by other researchers.
  
  4.) When attempting to run the script available at https://github.com/Kizielins/q2-predict-dysbiosis, I encountered an error related to the scikit-learn version. The script appears to be compatible with version 1.2.2, whereas I was using version 1.4.2. Please consider updating the script or providing instructions for resolving version compatibility issues.
  
  5.) The rationale behind considering only positive correlations when calculating the index is unclear. It would be helpful to clarify why negative correlations were excluded from the index calculations.
  
  6.) In analysing longitudinal alterations, did the authors account for dependencies from previous time points Q2D index? If not, how do these longitudinal alterations differ from those observed in independent studies?
  
  7.) For each dataset analysed, additional details would be useful, such as the number of samples, species, functions, core functions, and the number of species remaining after applying the MDFS algorithm.
  
  8.) On Page 13, the authors state that they chose GMHI as their benchmark because hiPCA and Shannon entropy produced worse results for the HMP2 cohort. However, Supplementary Table 3 indicates that Shannon entropy had a lower p-value than GMHI in the Mann-Whitney U test.
  
  Minor comments:
  
  1) Page 11 Original: "Collecting information on feature importance at every iteration of the cross-validation procedure model, we consistently identified the two GMHI parameters as the least important (Figure 5b)." Suggested: "Collecting information on feature importance at every iteration of the cross-validation procedure model, we consistently identified the two GMHI parameters as the least important (Figure 4b??)."
  
  2) Page 12 Original: "Most importantly, Q2PD produced visually the highest scores for all healthy in comparison to unhealthy cohorts." Suggested: "Most importantly, Q2PD produced visually the highest median?? scores for all healthy in comparison to unhealthy cohorts."
  
  3) Page 12 Original: "Q2PD was also the only index to produce a statistically significant difference between Healthy and Obese in HMP2" Suggested: "Q2PD was also the only index to produce a statistically significant difference between Healthy and Obese in AGP2??"
  
  4) Page 14 Original: "The Q2PD important in all datasets that were included in its training and validation, specifically AGP_1, AGP_2 and HMP2 (Table 1, Supplementary Figure 7)." Suggested: "The Q2PD important in all datasets that were included in its training and validation, specifically AGP_1, AGP_2 and HMP2 (Table 1, Supplementary Figure 8??)."
2. GigaScience 03 Apr 2025
  
  in GigaScience
  
  AbstractMicrobiome-based disease prediction has significant potential as an early, non-invasive marker of multiple health conditions linked to dysbiosis of the human gut microbiota, thanks in part to decreasing sequencing and analysis costs. Microbiome health indices and other computational tools currently proposed in the field often are based on a microbiome’s species richness and are completely reliant on taxonomic classification. A resurgent interest in a metabolism-centric, ecological approach has led to an increased understanding of microbiome metabolic and phenotypic complexity revealing substantial restrictions of taxonomy-reliant approaches. In this study, we introduce a new metagenomic health index developed as an answer to recent developments in microbiome definitions, in an effort to distinguish between healthy and unhealthy microbiomes, here in focus, inflammatory bowel disease (IBD). The novelty of our approach is a shift from a traditional Linnean phylogenetic classification towards a more holistic consideration of the metabolic functional potential underlining ecological interactions between species. Based on well-explored data cohorts, we compare our method and its performance with the most comprehensive indices to date, the taxonomy-based Gut Microbiome Health Index (GMHI), and the high dimensional principal component analysis (hiPCA)methods, as well as to the standard taxon-, and function-based Shannon entropy scoring. After demonstrating better performance on the initially targeted IBD cohorts, in comparison with other methods, we retrain our index on an additional 27 datasets obtained from different clinical conditions and validate our index’s ability to distinguish between healthy and disease states using a variety of complementary benchmarking approaches. Finally, we demonstrate its superiority over the GMHI and the hiPCA on a longitudinal COVID-19 cohort and highlight the distinct robustness of our method to sequencing depth. Overall, we emphasize the potential of this metagenomic approach and advocate a shift towards functional approaches in order to better understand and assess microbiome health as well as provide directions for future index enhancements. Our method, q2-predict-dysbiosis (Q2PD), is freely available (https://github.com/Kizielins/q2-predict-dysbiosis).
  
  This work has been peer reviewed in GigaScience (https://doi.org/10.1093/gigascience/giaf015), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 1: Vanessa Marcelino
  
  The manuscript proposes a new method to distinguish between healthy and diseased human gut microbiomes. The topic is timely, as to date, there is no consensus on what constitutes a healthy microbiome. The key conceptual advance of this study is the integration of functional microbiome features to define health. Their new computational approach, q2-predict-dysbiosis (Q2PD), is open source and available on GitHub.
  
  While the manuscript is conceptually innovative and interesting for the scientific community, there are several major limitations in the current version of this study.
  
  To develop the Q2PD, they define features associated with health by comparing it with microbiome samples from IBD patients. There are many more non-healthy/dysbiotic phenotypes beyond IBD, therefore it is not accurate to use IBD as synonymous of dysbiosis as done throughout this version of the paper.
  
  The study initially tests the performance of Q2PD against other gut microbiome health indexes (GMHI and hiPCA) using the same data that was used to select the health-associated features of Q2PD. Model performance should be assessed on independent data. On a separate analysis, they do use different datasets (from GMHI and hiPCA), but these datasets seem to be incomplete - GMHI and hiPCA publications have included 10 or more disease categories, and it is unclear why only 4 categories are shown in this study.
  
  While Q2PD does provide visible improvements in differentiating some diseases from healthy phenotypes, the accuracy and sensitivity of Q2PD isn't clear. To adopt Q2PD, I would like to know what are the chances that the classification results will be correct.
  
  There is very little documentation on how to use Q2PD. What are the expect outputs for example, do we need to chose a threshold to define health? Is the method completely dependent on Humann and Metaphlan outputs, or other formats are accepted? The test data contain some samples with zero counts. I got an error when trying it with the test data (ValueError: node array from the pickle has an incompatible dtype…).
  
  Therefore, I recommend including a range of disease categories to develop Q2PD and use independent datasets to validate the model in terms of accuracy and sensitivity. Alternatively, consider focusing this contribution on IBD. Making the code more user friendly will drastically increase the adoption of Q2PD by the community.
  
  Please also use page and line numbers when submitting the next version. Other suggestions:
  
  Abstract: I recommend replacing 'attributed' with 'linked', as 'attributed' suggests that dysbiosis may be causing (rather than reflecting) disease.
  
  Results: Please indicate what it is meant by 'function' here - it will be good to clarify that this method uses Metaphlan's read-based approach to identify metabolic pathways. What is used, pathway completeness or abundance?
  
  Results regarding Figure 3a are difficult to interpret. Is 'non-negatively correlated' the same as 'positively correlated'? What does the colour gradient represent - their abundance in those groups, or the strength of their correlation?
  
  "We observed that the prevalence of the pairs positively correlated in health was higher than in a number of disease-associated groups (Figure 3b)" . This is a very generalised statement considering that only half of the comparisons were significant. How co-occurring species were selected?
  
  "To test this, we compared the contributions of MDFS-identified species to "core functions" in different groups (Supplementary Figure 4)." How was this comparison made, based on species correlations? The caption of these figures could include more detail - it just says 'Top species contributions to functions.' but how do you define 'top' ? What do the colours represent?
  
  'This finding was congruent with our earlier suspicions of functional plasticity; modulation of function and thus altered connectivity in the interaction network, shifting towards less abundant, non-core functions upon perturbation of homeostasis.' This is reasonable, but I don't understand how you can draw this conclusion from these figures where there seems to be no significant difference between health and disease.
  
  Section 'Testing q2-predict-dysbiosis, GMHI and hiPCA accuracy of prediction for healthy and IBD individuals'
  
  What is the difference between fraction of "core functions" found the fraction of "core functions" among all functions?
  
  "Most importantly, Q2PD produced visually the highest scores for all healthy in comparison to unhealthy cohorts" . This was not statistically significant. In fact, GMHI finds more significant differences between health and disease than Q2PD.
  
  Sup. Figure 7 - would be informative to add the name/description of these metabolites not just their ID).
  
  'Although the threshold of 0.6 as determinant of health by the Q2PD was not applicable to the new datasets'. Does the threshold to define health with Q2PD change depending on the dataset? What are the implications of this for the applicability of this index?
  
  Effects of sequencing depth - this is a very good addition to the paper, the effects of sequencing depth can be profound but are ignored in most studies, so I commend the authors for doing this here. It would be even better, in my opinion, if this was done with the same datasets used to test/compare Q2PD with other methods, as using a different dataset here adds a new layer of confounding factors.
  
  'the GMHI and the hiPCA produced the opposite trend, wrongly indicating patient recovery.' The difference here is striking, what is driving this trend?
  
  The Gut Microbiome Wellness Index 2 (GMWI2) is now published. I don't think it needs to be part of the benchmarking, but it could be acknowledged/cited here.
  
  Methods: More information on how the data was processed is needed - how were the abundance tables normalized? Which output from Humann was used for downstream analyses?
  
  To ensure reproducibility, please provide the scripts/code used for analyses and figures.
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2023.12.04.569909v6
www.biorxiv.org www.biorxiv.org

Genomic and transcriptomic analyses of Heteropoda venatoria reveal the expansion of P450 family for starvation resistance in spider

2
1. GigaScience 03 Apr 2025
  
  in GigaScience
  
  AbstractBackground Spiders generally exhibit robust starvation resistance, with hunting spiders, represented by Heteropoda venatoria, being particularly outstanding in this regard. Given the challenges posed by climate change and habitat fragmentation, understanding how spiders adjust their physiology and behavior to adapt to the uncertainty of food resources is crucial for predicting ecosystem responses and adaptability.Results We sequenced the genome of H. venatoria and, through comparative genomic analysis, discovered significant expansions in gene families related to lipid metabolism, such as cytochrome P450 and steroid hormone biosynthesis genes. We also systematically analyzed the gene expression characteristics of H. venatoria at different starvation resistance stages and found that the fat body plays a crucial role during starvation in spiders. This study indicates that during the early stages of starvation, H. venatoria relies on glucose metabolism to meet its energy demands. In the middle stage, gene expression stabilizes, whereas in the late stage of starvation, pathways for fatty acid metabolism and protein degradation are significantly activated, and autophagy is increased, serving as a survival strategy under extreme starvation. Additionally, analysis of expanded P450 gene families revealed that H. venatoria has many duplicated CYP3 clan genes that are highly expressed in the fat body, which may help maintain a low-energy metabolic state, allowing H. venatoria to endure longer periods of starvation. We also observed that the motifs of P450 families in H. venatoria are less conserved than those in insects, which may be related to the greater polymorphism of spider genomes.Conclusions This research not only provides important genetic and transcriptomic evidence for understanding the starvation mechanisms of spiders but also offers new insights into the adaptive evolution of arthropods.
  
  This work has been peer reviewed in GigaScience (https://doi.org/10.1093/gigascience/giaf019), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 2: Sandra Correa-Garhwal
  
  The manuscript "Genomic and transcriptomic analyses of Heteropoda venatoria reveal the expansion of P450 family for starvation resistance in spider" uses comparative genomics to study the underlying mechanisms of starvation resistance. I appreciate that the authors have produced a high-quality genome for an RTA species. The methods are sound and some interesting gene families are highlighted as key factors in starvation resistance.
  
  One primary concern I have relates to the study's setup and hypothesis. As currently written, the study comes across as a fishing expedition rather than a focused research project. Although the introduction is informative, it lacks a clear rationale for including this particular species. The reasoning only becomes apparent at the end of the gene family expansion and contraction section. Additionally, I am unsure if being an active hunter makes feeding more unpredictable compared to web-based prey capture. I recommend incorporating this information into the introductory paragraph to better establish the context for the analysis. While terms like "autophagy" and "energy homeostasis" are appropriate for a scientific audience, consider briefly defining them for clarity, especially if the intended audience might not be familiar with all the terminology. Although authors mention that there is no high-quality genome sequence for H. venatoria, it could be helpful to elaborate on why this is significant for understanding starvation resistance. A brief explanation of how genomic data could enhance understanding of the molecular mechanisms involved would strengthen this point. The conclusion provides a clear goal for your study, but it could be more impactful. You might want to emphasize the broader implications of your research findings for ecological conservation and biodiversity. End with a statement about the importance of understanding these mechanisms in the context of preserving ecosystems and addressing challenges posed by climate change.
  
  For the discussion, while the content is detailed, some parts feel slightly repetitive or could be more concise. For instance, the description of P450 gene expression could be streamlined by removing redundant mentions of their role in metabolic rate regulation. Example: In the discussion section "Interestingly, we found that some P450 families are expanded in H. venatoria, and most P450 genes are more highly expressed in the fat body than in other tissues…" This point is later reiterated in the sentence about other spider species. These ideas could be combined for efficiency. The paragraph about the phylogenetic analysis of the CYP3 clan could be shortened. While it is an interesting finding, some of the details (like the number of genes or proteins) might be better suited for the main text rather than a summary. Focusing more on the functional implications of these duplications would keep the reader engaged. Though the findings are well-explained, the broader significance could be emphasized more explicitly. For example, why is understanding these mechanisms important for the field of arachnid biology, evolutionary biology, or even practical applications (e.g., pest control, conservation)? You could add a closing sentence that ties everything together and highlights the broader relevance of the findings, such as the evolutionary or ecological importance of these adaptations in spiders.
  
  Other comments: Last paragraph of the introduction: When introducing Heteropoda venatoria, please spell out the species name the first time that is used. The sentence "However, these findings indicate that H. venatoria does not feed in a stable manner and often experiences periods of starvation." Does not fit the rest of the text. Finding from what study? Transcription design for starvation resistance in H. venatoria section: First sentence: What samples? confusing to start like this. Please add information about the samples. You could delete "the samples of H. venatoria were subjected to" it will read better. Are all 23 CYP# clan genes on chromosome 4 tandemly arrayed? Figure 4 - add more information about the figure. For pannel C, What do the red lines show? Grey? Numbers in the circles? While I know what they represent, other readers might not. The finding that H. venatoria chromosomes have undergone lots of chromosomal fragmentation is very interesting, and it is clearly shown on the figure. Which is why I think that more detail is needed. In this sentence "In Uloborus diversus, members of this subfamily are located on Chr5 and an unanchored scaffold." You need to specify which members. Figure 5 - Include a description of the tissues. What is Epi? Ducts? Tail?
2. GigaScience 03 Apr 2025
  
  in GigaScience
  
  AbstractBackground Spiders generally exhibit robust starvation resistance, with hunting spiders, represented by Heteropoda venatoria, being particularly outstanding in this regard. Given the challenges posed by climate change and habitat fragmentation, understanding how spiders adjust their physiology and behavior to adapt to the uncertainty of food resources is crucial for predicting ecosystem responses and adaptability.Results We sequenced the genome of H. venatoria and, through comparative genomic analysis, discovered significant expansions in gene families related to lipid metabolism, such as cytochrome P450 and steroid hormone biosynthesis genes. We also systematically analyzed the gene expression characteristics of H. venatoria at different starvation resistance stages and found that the fat body plays a crucial role during starvation in spiders. This study indicates that during the early stages of starvation, H. venatoria relies on glucose metabolism to meet its energy demands. In the middle stage, gene expression stabilizes, whereas in the late stage of starvation, pathways for fatty acid metabolism and protein degradation are significantly activated, and autophagy is increased, serving as a survival strategy under extreme starvation. Additionally, analysis of expanded P450 gene families revealed that H. venatoria has many duplicated CYP3 clan genes that are highly expressed in the fat body, which may help maintain a low-energy metabolic state, allowing H. venatoria to endure longer periods of starvation. We also observed that the motifs of P450 families in H. venatoria are less conserved than those in insects, which may be related to the greater polymorphism of spider genomes.Conclusions This research not only provides important genetic and transcriptomic evidence for understanding the starvation mechanisms of spiders but also offers new insights into the adaptive evolution of arthropods.
  
  This work has been peer reviewed in GigaScience (https://doi.org/10.1093/gigascience/giaf019), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
  
  Reviewer 1: Hui Xiang
  
  In this study, the authors deciphered the chromosome-level genome of a RTA spider Heteropoda venatoria with large body size and generated comprehensive comparative transcriptomes of fat body and whole body among CK and starvation status. Generally, this study added important genomic and transcriptomic data of spiders and provided some cues in understanding the molecular changes during starvation. However, the organization of the manuscript is quite problematic. 1. As to the Results section, please be concise and highlight the main results，avoiding accumulating complex results. Do not present too many statements in terms of introduction and discussion in Results. Do not raise too many hypotheses in the results. 2. As for the involvement of the Hippo signaling pathway in lipid metabolism regulation, the cited literature and mentioned genes are not related to the results of this study. As for the analysis of P450 results, the descriptions of structural analysis are quite complex and difficult to understand. The authors did not explain clearly the relationship between the expansion of P450 genes and hunger resistance in the results of this study. 3. The author's analyses of DEG enrichment results in transcriptome analysis is confusing. Firstly,I can't agree with the authors in that "During the early stage of starvation (from CK to 2 W), many genes, specifically those involved in oxidative phosphorylation and thermogenesis pathways, were up-regulated (Fig. 2E). These findings indicate that during the early starvation stage, energy metabolism in H. venatoria occurs regularly, with sufficient supply of energy." There are a batch of DEGs between 2W and CK, and a lot of pathways involved in neurodegeneration related pathways. How to explain these changes? Secondly, as to 4W to 8W, I can not understand the relationship of down-regulation of hippo signaling pathway to the authors' speculation that "H. venatoria may reduce its cellular glucose uptake and utilization to adjust to the food-scarce environment.", as this pathway involved in lipid metabolism, as the authors stated. Thirdly, from 14 W to 19 W, pathways such Lysosome and apoptosis were down-regulated instead of up-regulated. So how the authors thought autophagy became more active? 4. "We speculate that during the evolution of spider genomes, two types of repeat sequences, TcMar and LTR sequences, had a significant impact on the size of spider genomes. Interestingly, we found that in H. venatoria chromosomes, regions with a high proportion of repeats also presented an increase in GC content (Fig. 1B)" The author's conclusion that high repeat region has higher CG content is based on Fig1B alone, which is too arbitrary. They needs more solid evidence and more detailed analysis. For example, the GC content of TE region could be compared with that of whole genome, and the GC content of gene region. The significance of the relevant results should be explained. In addition, the author should make a more convincing discussion of this result based on the more literature. 5. "We gathered genomic data and annotations for one scorpion and seven chromosome-level spider genomes using the scorpion as an outgroup [35-42]"。Many spider genomes have been published at the chromosomal level. What were the principles behind the spider genomes the authors selected in this study? 6. "Transcriptome design for starvation resistance in H. venatoria" in Results should be partially moved Methods and here the authors should straightforwardly highlighted the results . 7. I can't understand the significance of Fig 2C. The authors did not explain it in the manuscript, either. 8. "The PCA results from both the fat body and whole-body transcriptomes indicated that H. venatoria transcriptome at 19 weeks of starvation was markedly distinct from that at other stages (Fig. 2A, B). Consequently, we conducted a differential analysis of the transcriptome at 19 weeks." Please clarify how the comparative transcriptomes were conducted for differential analysis. 9. The language should be polished.
Visit annotations in context

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2024.07.31.605936v1
Mar 2025
www.biorxiv.org www.biorxiv.org

CompactTree: A lightweight header-only C++ library for ultra-large phylogenetics

1
1. GigaScience 23 Mar 2025
  
  in GigaByte
  
  Editors Assessment:
  
  As volumes of viral and bacterial sequence data grow exponentially, the field of computational phylogenetics now demands resources to manage the burgeoning scale of this input data. This study introduces CompactTree, a C++ library designed for ultra-large phylogenetic trees with millions of tips. To address these scalability issues while maintaining ease of incorporation into external code bases, CompactTree is a header-only library with enhanced performance utilizing minimal dependencies, optimized node representation, and memory-efficient tree structure schemes. Resulting in significantly reduced memory footprints and improved processing times. Peer review requested some more detail on the functionality and some real-world examples, demonstrating the current utility of the tool. Although primarily supporting the (text-based) Newick format, the increased and extensibility scalability holds promise for multiple biological and epidemiological applications supporting more complex formats such as Nexus and NeXML. The tool is open source (GPLv3 licensed) and available in GitHub: https://niema.net/CompactTree
  
  This evaluation refers to version 1 of the preprint
  
  Summary
Visit annotations in context

Tags

Summary

Annotators

GigaScience

URL

biorxiv.org/content/10.1101/2024.07.15.603593v1

GigaScience

Annotations: 1,071

Joined: September 13, 2019

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Definition and calculation of endogenous DNA fraction

Need a better explanation of the "month of collection" variable

Need a clarification of "Collection Climate" vs. Herbarium Storage

Need for the integration of non-deamination mismatch controls and baseline divergence

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Tags

Annotators

URL

Annotators

URL

Annotators

URL

Message sent by the authors

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Strengths

Weaknesses

Minor Issues

Annotators

URL

Annotators