Reviewer #3 (Public Review):
Summary:
Buck et al., set out to characterize small DNA tumor viruses through the generation and analysis of ~100,000 public sequencing datasets from the SRA and other databases. Using a variety of powerful bioinformatic methods including alignment-based searches, statistical modelling, and structure-aware detection, the authors successfully classify novel protein sequences which support the occurrence of evolutionary gene transfer between DNA virus families. The authors propose a naming scheme to better capture viral diversity and uncover novel chimeric viruses, those containing genes from multiple established virus families. Additional analysis using the generated dataset was performed to search for DNA and RNA viruses of interest, demonstrating the utility of generated datasets for exploratory screens. The assembled sequencing datasets are publicly available, providing invaluable resources for current and future investigations within this subfield.
Strengths:
The scope of data analysis (100,000+ SRA records and additional libraries) is substantial, and the authors have contributed to further insight into the modularity of previously uncharacterized viral genomes, through computationally demanding advanced bioinformatics analyses in addition to extensive manual inspection.
The publicly available resources generated as a result of these analyses provide useful data for further experiments to inspect viral diversity and modularity. Other scanning experiments and further investigation of biologically relevant viruses using these contigs may uncover, for example, animal reservoirs or novel recombinant viruses of significance.
Novel instances of genomic modularity provide excellent starting points for understanding virus evolutionary pathways and gene transfer events.
Weaknesses:
Overall, the methods section of this paper requires more detail.
The inclusion criteria for which "SRA" datasets were or were not utilized within this study are poorly defined. This means the comprehensiveness of the study for a given search space of the SRA is not defined, and the results are ultimately not reproducible, or expandable. For example, are all vertebrate RNA-seq samples processed? Or just aquatic vertebrate RNA-seq? Were samples randomly sampled from a more comprehensive data set? What is the make-up of the search space and how much was DNA-seq or RNA-seq? This section should be expanded and explicit accounting provided for how dataset selection was performed. This would provide additional confidence in the results and conclusions, as well as allow for future analysis to be conducted.
Hallmark virus genes require further clarification, as it is unclear what genes are utilized as bait, or in the initial search process. The reported "Hallmark gene sets" are not described in a systematic way. What is the sensitivity and specificity of these gene sets? Was there a validation of the performance characteristics (ROC) for this gene set with different tools? How is this expected to be utilized? Which kinds of viruses are excluded/missed? Are viroids included?
For the Tailtomavirus, additional information is needed for sufficient confidence. Was this "chimeric" genomic arrangement detected in a single library? This raises a greater issue of how technical artifacts, which may appear as chimeric assemblies, are ruled out in the workflow. If two viral genomes share a k-mer of length greater than the assembly k, the graph may become merged. Are there read pairs that span all regions of the genome? Is there evidence for multiple homologous viruses with synteny between them that supports the combination of these genes as an evolving genome, or is this an anomalous observation? Read alignments should be included and Bandage graph visualization for all cases of chimeric assemblies and active steps to disprove the baseline hypotheses that these are technical artifacts of genome assembly.
Justification for exclusion of endogenized sequences is not included and must be described, as small DNA tumor viruses may endogenize into the host genome as part of their life cycle. How is such an integration resolved from an evolutionary "endogenization"? What's the biological justification for this step?
Additional supporting information, clear presentation, and context are needed to strengthen results and conclusions.
Basic reporting of global statistics, such as the total number of viruses found per family, should be included in the main text to better support the scope of the results. How many viruses (per family) were previously known, and therefore what is the magnitude of the expansion performed here?
Additional parameters and information should be included in bioinformatic tool outputs to provide greater clarity and interpretation of results. For example, reporting the "BLASTp E-val", as for the PolB homology (BLASTp 6E-12) is not informative, and does not tell the reader this is (we assume) an expectancy value. For each such case please report, the top database hit accession, percent identity, query coverage, and E-value. Otherwise, a judgment cannot be adequately made regarding the quality of evidence for homology. Similarly, for HHpred what does the number represent - confidence, identity, or coverage?
Some findings described in the Results section may require revision. Several of the Nidoviruses (Nidovirus takifugu, Nidovirus hypomesus, Nidovirus ambystoma, etc...) have been previously described by three groups, first by Edgar et al., (https://www.nature.com/articles/s41586-021-04332-2), then Miller et al., (https://academic.oup.com/ve/article/7/2/veab050/6290018) and then Lauber et al., (https://journals.plos.org/plospathogens/article?id=10.1371/journal.ppat.1012163). This is now the 4th description of the same set of viruses. These sequences are in GenBank (https://www.ncbi.nlm.nih.gov/nuccore/OV442424.1), although it is unclear why they're not returned as BLAST hits. Miller also described the Togavirus co-segment previously.
It is also uncertain what is being described with HelPol/maldviruses which was not previously described in distantly similar relatives. How many were described in the previous literature and how many are described by this work?
Co-phylogenies should be used to convey gene transfer and flow clearly to support the conclusions made in the text.
Statements such as, "The group encompasses a surprising degree of genomic diversity...", should be supported by additional information to strengthen conclusions (e.g., what the expected diversity is). What is the measurement for genomic diversity here, and why is this surprising? There is overall a lack of quantification to support the conclusions made throughout the paper.