764 Matching Annotations
  1. Jan 2022
    1. Now published in GigaScience doi: 10.1186/s13742-016-0111-z

      A version of this preprint has been published in the Open Access journal GigaScience (see paper https://doi.org/10.1186/s13742-016-0111-z ), where the paper and peer reviews are published openly under a CC-BY 4.0 license.

      These peer reviews were as follows:

      Reviewer 1: http://dx.doi.org/10.5524/REVIEW.100374 Reviewer 2: http://dx.doi.org/10.5524/REVIEW.100375

    1. Findings

      Reviewer 3. Tom Madden; Minor Essential Revisions: The descriptions for TBLASTN and TBLASTX need to be swapped in Table 1 (rows 4 and 5).

      Discretionary Revisions This is a well structured and informative article about supporting sequence similarity searching (BLAST) in Galaxy. The Results section contains a number of example applications such as "Assessing a de novo assembly". This section does have references to some of the tools that would be useful in this task (e.g., seq_filter_by_id), but I find these examples somewhat abstract. Additionally, a reader unfamiliar with Galaxy might not be convinced of the advantages of using Galaxy for these tasks as opposed to simply running the searches themselves. I'd suggest that the authors add a concrete example of one of their use-cases.

      For example, the authors could show how their tools could find the globin cluster for some mammal. This should include all accessions and other information so that a reader could reproduce the result. This example could be added as supplementary material if need be.

    2. Background

      Reviewer 2. Gianmauro Cuccuru

      The authors present their effort to integrate the command line NCBI BLAST+ tool suite into the Galaxy platform, providing a full set of wrappers, BLAST related tools and datatype definitions.

      The manuscript is clearly written and includes several useful use-cases and workflows combining the tools within Galaxy. The tools are not available through a public server for testing, but the authors crafted an excellent virtual machine providing a complete Galaxy server with the BLAST+ tools preinstalled. In my opinion the work represents a valuable addition to the software resources available to the Galaxy community, hence my recommendation to its publication.

    3. Abstract

      Reviewer 1. Stian Soiland-Reyes

      This review is also available at Description The article describes a mechanism to add the BLAST+ functionality to the Galaxy workflow system. This is a very useful feature, and so in principle I would want to see this article published. I do however have some concerns with the aspects of reproducibility and documentation, which are detailed below.

      Major Compulsory Revisions

      I am afraid I will have to ask for major compulsory revisions as I was unable to reproduce any the claims of the paper.

      1: Docker image is not BLAST enabled

      p5.

      the command docker ... start a BLAST enabled Galaxy instance I tried the docker image. It starts up fine, and presents a Galaxy that includes a list of BLAST tools - so the BLAST tools have been installed. The docker instance is however not BLAST enabled, as the BLAST tools requires further configuration/download of the external BLAST reference database to align against. This procedure is loosely documented at https://registry.hub.docker.com/u/bgruening/galaxy-blast/ - but I was unable to follow through with this installation as it was quite complicated and seems to require manual downloading and configuration of many GB of reference data spread over more than 300 files. I was assuming that a docker image would be 'usable out of the box' - but this is far from the truth in this case. Accessing "NCBI BLAST+ database info" gives an empty dropdown list in The article mentions that the public Galaxy instance usegalaxy.com does not provide the BLAST tools by default due to concerns over computational load - but I am also worried if it could be because configuring the BLAST+ tools is quite a complicated job. The article does not mention at all the excessive amount of system administraton that is required in order to finalize the BLAST installation, and the docker image does not provide any helper scripts to assist with this. In fact, the example database configuration files uses a totally different path, e.g. /depot/data2/galaxy/blastdb/nt/nt.chunk - while the docker image would require these under /data/nt/nt.chunk. The article or Docker README does not mention which subset of the databases would commonly need to be downloaded - or even the fact that all of the numbered fragments need to be downloaded. The dataset referenced from the example configuration, e.g. nt.chunk and wgs.chunk do not exist on ftp://ftp.ncbi.nlm.nih.gov/blast/db/ - only non-chunk version exist. I tried to download a subset of the datasets from ftp://ftp.ncbi.nlm.nih.gov/blast/db/ stain@biggie-utopic:/galaxy_store/data/blast_databases$ ls human_genomic.00.nhd nt.00.nhd refseq_genomic.148.nhr refseq_protein.00.pin refseq_protein.15.pnd wgs.00.nhi human_genomic.00.nhi nt.00.nhi refseq_genomic.148.nin refseq_protein.00.pnd refseq_protein.15.pni wgs.00.nhr human_genomic.00.nhr nt.00.nhr refseq_genomic.148.nnd refseq_protein.00.pni refseq_protein.15.pog wgs.00.nin human_genomic.00.nin nt.00.nin refseq_genomic.148.nni refseq_protein.00.pog refseq_protein.15.ppd wgs.00.nnd human_genomic.00.nnd nt.00.nnd refseq_genomic.148.nog refseq_protein.00.ppd refseq_protein.15.ppi wgs.00.nni human_genomic.00.nni nt.00.nni refseq_genomic.148.nsd refseq_protein.00.ppi refseq_protein.15.psd wgs.00.nog human_genomic.00.nog nt.00.nog refseq_genomic.148.nsi refseq_protein.00.psd refseq_protein.15.psi wgs.00.nsd human_genomic.00.nsd nt.00.nsd refseq_genomic.148.nsq refseq_protein.00.psi refseq_protein.15.psq wgs.00.nsi human_genomic.00.nsi nt.00.nsi refseq_genomic.148.tar.gz refseq_protein.00.psq refseq_protein.15.tar.gz wgs.00.nsq human_genomic.00.nsq nt.00.nsq refseq_genomic.nal refseq_protein.15.phr refseq_protein.pal wgs.nal human_genomic.nal nt.nal refseq_protein.00.phr refseq_protein.15.pin wgs.00.nhd and configured these in blastdb.loc according to the Docker readme. The readme says: you need to add the paths to your blast databases and they need to look like /export/swissprot/swissprot but I have followed the instructions three lines above which mounted the datasets at /data - hence I used /data/ instead of /export. Some consistency would help here. stain@biggie-utopic:/galaxy_store/data/blast_databases$ grep -v ^# /tmp/galaxy/galaxy-central/tool-data/blastdb*loc /tmp/galaxy/galaxy-central/tool-data/blastdb.loc:nt_02_Dec_2009 nt 02 Dec 2009 /data/nt /tmp/galaxy/galaxy-central/tool-data/blastdb.loc:wgs_30_Nov_2009 wgs 30 Nov 2009 /data/wgs/wgs /tmp/galaxy/galaxy-central/tool-data/blastdb.loc:refseq_genomic_148 refseq 148 /data/refseq_genomic /tmp/galaxy/galaxy-central/tool-data/blastdb.loc: /tmp/galaxy/galaxy-central/tool-data/blastdb.loc: /tmp/galaxy/galaxy-central/tool-data/blastdb_p.loc:nt_02_Dec_2009 nt 02 Dec 2009 /data/nt /tmp/galaxy/galaxy-central/tool-data/blastdb_p.loc:wgs_30_Nov_2009 wgs 30 Nov 2009 /data/wgs/wgs /tmp/galaxy/galaxy-central/tool-data/blastdb_p.loc:refseq_protein refseq protein /data/refseq_genomic /tmp/galaxy/galaxy-central/tool-data/blastdb_p.loc: /tmp/galaxy/galaxy-central/tool-data/blastdb_p.loc: A BLAST Data Manager is available at at https://github.com/peterjc/galaxy_blast/ - in theory this can download and populate the blastdb data table. This is mentioned as "Future work" in the article, so presumably it is not yet production ready. This data manager does not appear under Data Libraries in the docker image, and it is not included in the installation at https://registry.hub.docker.com/u/bgruening/galaxy-blast/dockerfile/

      2: Galaxy Tool Shed not working with the docker image

      The article says: The recently published Galaxy Tool Shed [9] allows anyone hosting a Galaxy instance to install tools and defined dependencies with a few clicks right from the Galaxy web application itself. I am unable to verify this claim using the provided Docker image. I am unable to install any tools from the Galaxy Tool Shed from the web interface of the Docker image. I am logged in as admin@galaxy.org according to the instructions, but if I go to Admin -> Search and browse tool sheds http://localhost:8080/admin_toolshed/browse_tool_sheds and click the dropdown list for "Browse valid sheds", this hangs for a while before failing with "Can't find the server". On the console I get many error messages like: URLError: <urlopen error [Errno -2] Name or service not known> tool_shed.util.shed_util_common ERROR 2015-01-19 10:22:08,698 Error attempting to get tool shed status for installed repository ncbi_blast_plus:

      <urlopen error [Errno -2] Name or service not known> Traceback (most recent call last): File "lib/tool_shed/util/shed_util_common.py", line 772, in get_tool_shed_status_for_installed_repository encoded_tool_shed_status_dict = common_util.tool_shed_get( app, tool_shed_url, url ) File "lib/tool_shed/util/common_util.py", line 345, in tool_shed_get response = urlopener.open( uri ) File "/usr/lib/python2.7/urllib2.py", line 404, in open response = self._open(req, data) File "/usr/lib/python2.7/urllib2.py", line 422, in _open '_open', req) File "/usr/lib/python2.7/urllib2.py", line 382, in _call_chain result = func(*args) File "/usr/lib/python2.7/urllib2.py", line 1214, in http_open return self.do_open(httplib.HTTPConnection, req) File "/usr/lib/python2.7/urllib2.py", line 1184, in do_open raise URLError(err) Inspecting the internal frame I see the link is http://toolshed.g2.bx.psu.edu/repository/browse_valid_categories?galaxy_url=http://localhost:8080 I somehow feel that toolshed.g2.bx.psu.edu will try to connect to my galaxy instance at http://localhost:8080 - which is not going to work. The actual toolshed site is unavailable http://www.downforeveryoneorjustme.com/toolshed.g2.bx.psu.edu It's not just you! http://toolshed.g2.bx.psu.edu looks down from here. ..so this might be an temporary network problem that is not related to the "localhost" bit. I am nevertheless unable to verify the claim of the ease of using the Tool Shed to install the BLAST+ tool because of this. If it is true that the Docker image does not work with the Galaxy Tool Shed - which now host most of the tools required in a Galaxy installation, then this should be duely noted in the article and the README of the Docker image.

      3: Supporting data has no usage instructions

      The article links to https://github.com/peterjc/galaxy_blast as the supporting data - but this website has no instructions on how to install/use with Galaxy or the Galaxy docker image. I could execute .travis.yml "by hand" - but I do not feel this is sufficient documentaton for a supporting data set. I have therefore not been able to verify that the supporting data actually supports the article, beyond inspecting the Travis-CI build logs at https://travis-ci.org/peterjc/galaxy_blast/builds .. which except for a single error seem to be verifying the tools. https://travis-ci.org/peterjc/galaxy_blast/builds/45137901 OperationalError: (OperationalError) unable to open database file None None

      Minor Essential Revisions

      4: Provenance and update issue not addressed

      Workflow systems are commonly praised in bioinformatics because they enable reproducibility and sharing of analytical pipelines. One challenge in this aspect is that the domain of bioinformatics commonly update software tools and reference datasets. In fact, BLAST+ 2.2.30 was released just 6 weeks ago [ ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/ ] and the latest BLAST reference dataset taxdb.tar.gz was updated today [ ftp://ftp.ncbi.nlm.nih.gov/blast/db/ ]. The blast FTP site does not seem to contain any version number of the dataset, and as datasets are split over multiple files, it would be difficult to know if you have downloaded half an old dataset and half a new dataset. (A new version of the dataset could be released in the middle of your lengthy download). The sanity of the dataset could potentially be verified by downloading the *.md5 files both before and after the large download -- but this should be automated by a script to be done correctly. I would say the main challenges are in this respect, assuming a successful workflow run using the described Galaxy BLAST tools: a) Which version of the BLAST tool was used? b) Which reference data set was used? c) Which version? d) Was the install complete/sane? (ref. MD5 files and updates) e) Are there any later versions of tool or reference data set? How do I keep my Galaxy instance up to date? (going through the lengthy database download+config again?) f) How can a Galaxy workflow using BLAST+ be shared with another Galaxy instance (supposedly easily started with Docker), when manual download and configuration of databases are required? Your article does not mention how the BLAST+ tool for Galaxy addresses any of these concerns. The use of the Galaxy Tool Shed should in theory allow for automatic updating of the tool - and I believe the BLAST tools would output log information that includes at least version number. I am worried that the dataset description that is entered manually by the system administrator into /galaxy-central/tool-data/blastdb.loc and friends contain an element of "manual versioning", as the example contains

      nt_02_Dec_2009 nt 02 Dec 2009 /depot/data2/galaxy/blastdb/nt/nt.chunk

      wgs_30_Nov_2009 wgs 30 Nov 2009

      /depot/data2/galaxy/blastdb/wgs/wgs.chunk This sounds very errorprone, and as older datasets not available from NCBI, definitiely not reproducible. I would expect the article to at least acknowledge these concerns - and ideally for the tooling to support this (e.g. through the BLAST Data Manager and additional provenance output from the BLAST+ tools, e.g. in W3C PROV format).

      5: Results workflows unavavailable

      We now describe some use-cases and workflows combining these tools within Galaxy. The first two examples:

      • Assessing a de novo assembly
      • Finding genes of interest in a de novo assembly do not link to any actual Galaxy workflow descriptions, but are only described as bullet point lists. The descriptions do not link to any examples for "Upload ** sequence" or of the expected outputs. "Identifying candidate genes clusters" is described in more detail, but the workflow is only included as a visual Figure, and not in the supporting data or uploaded/linked to an external repository like the mentioned myExperiment. The citation for this workflow, [22] http://dx.doi.org/10.1021/ja501630w is not Open Access, and I was required to use the University of Manchester. The article does not mention the word "workflow" once, and do not seem to contain any data citations for the workflow, only for the sequence. The only supporting information provided at http://pubs.acs.org/doi/suppl/10.1021/ja501630w is a PDF with tables, graphs and sequence views. Again the word "workflow" is not mentioned. A direct link to the workflow definition should be included for all three examples.

      Discretionary Revisions

      Spelling corrections for product/company names p2:

      • MyExperiment -> myExperiment
      • Amazon Inc. -> Amazon AWS
      • "Cloud Computing" -> Cloud Computing
      • Galaxy "CloudMan" -> Galaxy CloudMan
      • "Galaxy Tool Shed" -> Galaxy Tool Shed p5:
      • "such FASTA format" -> "such as FASTA format"
      • Docker Inc. -> Docker Inc. (https://www.docker.com/)
      • Galaxy "CloudMan" -> Galaxy CloudMano Useful hyperlinks:
      • whose functional tests are then run. -> whose ..then run (https://travis-ci.org/peterjc/galaxy_blast).
      • The Galaxy-P project -> The Galaxy-P project https://usegalaxyp.org/
    4. Now published in GigaScience doi: 10.1186/s13742-015-0080-7 Peter J. A. Cock 1Information and Computational Sciences, James Hutton Institute, Invergowrie, Dundee DD2 5DA, Scotland, UKFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Peter J. A. CockFor correspondence: peter.cock@hutton.ac.ukJames E. Johnson 2Minnesota Supercomputing Institute, University of Minnesota, 599 Walter Library, 117 Pleasant St. SE, 55455, Minneapolis, Minnesota, USAFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for James E. JohnsonNicola Soranzo 4CRS4, Loc. Piscina Manna, 09010 Pula (CA), ItalyFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Nicola Soranzo

      A version of this preprint has been published in the Open Access journal GigaScience (see paper https://doi.org/10.1186/s13742-015-0080-7), where the paper and peer reviews are published openly under a CC-BY 4.0 license.

    1. Now published in GigaScience doi: 10.1186/s13742-015-0101-6 Judith Risse 1Edinburgh Genomics, School of Biological Sciences, The King’s Buildings, The University of Edinburgh, EH9 3FLFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteMarian Thomson 1Edinburgh Genomics, School of Biological Sciences, The King’s Buildings, The University of Edinburgh, EH9 3FLFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteGarry Blakely 2Institute of Cell Biology, School of Biological Sciences, The King’s Buildings, The University of Edinburgh, EH9 3BFFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteGeorgios Koutsovoulos 3Institute of Evolutionary Biology, School of Biological Sciences, The King’s Buildings, The University of Edinburgh, EH9 3FLFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteMark Blaxter 1Edinburgh Genomics, School of Biological Sciences, The King’s Buildings, The University of Edinburgh, EH9 3FL3Institute of Evolutionary Biology, School of Biological Sciences, The King’s Buildings, The University of Edinburgh, EH9 3FLFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteMick Watson 1Edinburgh Genomics, School of Biological Sciences, The King’s Buildings, The University of Edinburgh, EH9 3FL4The Roslin Institute, University of Edinburgh, Easter Bush, EH25 9RGFind this author on Google ScholarFind this author on PubMedSearch for this author on this site

      A version of this preprint has been published in the Open Access journal GigaScience (see paper https://doi.org/10.1186/s13742-015-0101-6 ), where the paper and peer reviews are published openly under a CC-BY 4.0 license.

      These peer reviews were as follows:

      Reviewer 1: http://dx.doi.org/10.5524/REVIEW.100346 Reviewer 2: http://dx.doi.org/10.5524/REVIEW.100344

    1. Now published in GigaScience doi: 10.1093/gigascience/giab060 David Johnson 1Oxford e-Research Centre, Department of Engineering Science, University of Oxford, 7 Keble Road, OX1 3QG, Oxford, United Kingdom2Department of Informatics and Media, Uppsala University, Box 513, 751 20 Uppsala, SwedenFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for David JohnsonAlejandra Gonzalez-Beltran 1Oxford e-Research Centre, Department of Engineering Science, University of Oxford, 7 Keble Road, OX1 3QG, Oxford, United Kingdom5Science and Technology Facilities Council, Scientific Computing Department, Rutherford Appleton Laboratory, Harwell Campus, Didcot, OX11 0QX, United KingdomFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Alejandra Gonzalez-BeltranKenneth Haug 3European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, United Kingdom6Genome Research Limited, Wellcome Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Saffron Walden CB10 1RQ, United KingdomFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Kenneth HaugMassimiliano Izzo 1Oxford e-Research Centre, Department of Engineering Science, University of Oxford, 7 Keble Road, OX1 3QG, Oxford, United KingdomFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Massimiliano IzzoMartin Larralde 7Structural and Computational Biology Unit, European Molecular Biology Laboratory (EMBL), Meyerhofstraße 1, 69117 Heidelberg, GermanyFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Martin LarraldeThomas N. Lawson 8School of Biosciences, University of Birmingham, Edgbaston, Birmingham, B15 2TT, United KingdomFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Thomas N. LawsonAlice Minotto 4Earlham Institute, Norwich Research Park, Norwich NR4 7UZ, United KingdomFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Alice MinottoPablo Moreno 3European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, United KingdomFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Pablo MorenoVenkata Chandrasekhar Nainala 3European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, United KingdomFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Venkata Chandrasekhar NainalaClaire O’Donovan 3European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, United KingdomFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Claire O’DonovanLuca Pireddu 9Distributed Computing Group, CRS4: Center for Advanced Studies, Research & Development in Sardinia, Pula, ItalyFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Luca PiredduPierrick Roger 10CEA, LIST, Laboratory for Data Analysis and Systems’ Intelligence, MetaboHUB, Gif-Sur-Yvette F-91191, FranceFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Pierrick RogerFelix Shaw 4Earlham Institute, Norwich Research Park, Norwich NR4 7UZ, United KingdomFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Felix ShawChristoph Steinbeck 11Cheminformatics and Computational Metabolomics, Institute for Analytical Chemistry, Lessingstr. 8, 07743 Jena, GermanyFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Christoph SteinbeckRalf J. M. Weber 8School of Biosciences, University of Birmingham, Edgbaston, Birmingham, B15 2TT, United Kingdom12Phenome Centre Birmingham, University of Birmingham, Edgbaston, Birmingham, B15 2TT, United KingdomFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Ralf J. M. WeberSusanna-Assunta Sansone 1Oxford e-Research Centre, Department of Engineering Science, University of Oxford, 7 Keble Road, OX1 3QG, Oxford, United KingdomFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Susanna-Assunta SansoneFor correspondence: susanna-assunta.sansone@oerc.ox.ac.uk philippe.rocca-serra@oerc.ox.ac.ukPhilippe Rocca-Serra 1Oxford e-Research Centre, Department of Engineering Science, University of Oxford, 7 Keble Road, OX1 3QG, Oxford, United KingdomFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Philippe Rocca-SerraFor correspondence: susanna-assunta.sansone@oerc.ox.ac.uk philippe.rocca-serra@oerc.ox.ac.uk

      A version of this preprint has been published in the Open Access journal GigaScience (see paper https://doi.org/10.1093/gigascience/giab060), where the paper and peer reviews are published openly under a CC-BY 4.0 license.

      These peer reviews were as follows:

      Reviewer 1: Kevin Menden In the paper "ISA API: An open platform for interoperable life science experimental metadata" Johnson et al. present a extensive Python API for reading, writing and handling of metadata in the ISA format. The authors describe the increasing use of the ISA formats and thus indicate the need for better tools to handle such data. The article is well written and good to understand. The ISA tools package contains extensive functionality and a solid documentation. Furthermore, it can be installed with PyPi and Bioconda, which I think should be standard nowadays. The authors furthermore provide a docker image, which is nice. All in all, I think the ISA tools package is a genuinely useful piece of software that is well written, which is why I recommend this manuscript for publication in GigaScience. However, a few minor things should be changed. Personally, I would like to know whether support for the upload to additional databases will be added in the future - this could be noted in the text. The article contains many figures with only little content. I would strongly advise to merge some of these figures into a smaller subset of figures to improve the readability. The authors spend a considerable amount of text on download statistics - something that in my opinion is not really that relevant for the software package. I would recommend to considerably shorten this section. On a similar note, the methods section basically just describes how these download statistics were handled. Considering this article describes a software package, it might be more useful to the reader (and reviewer) to elaborate a bit on how the software is written, maintained, structured, tested - and related things.

      Reviewer 2: Manuel Holtgrewe The authors describe the Python library "isatools" for accessing ISA (investigation study assay) files in ISA-tab and ISA-json format. The authors start by sdescribing their previous work around the ISA data model and file formats in detail. They then describe their implementation and the features of their API. They highlight the extensibility and efficiency of their object oriented model. They describe in detail how meta data can be curated in ontologies and that currently extensions are underway for the assisted creation of study meta data. They then refer to early adopters and a stable and growing community. They conclude with the statement that their library is "a major step forward in making the ISA framework open and interoperable".

      General Remarks

      Overall, we have found the ISA data model and ISA-tab data format to be very useful in our own work. However, there are some issues with the software including apparent bugs as described below. In 2018, my colleagues and me considered using ISA-API in our project for ISA-Tab parsing but the problems and the lack of automated tests made us roll our own (also see below). Overall, the authors make a clear point, the paper is well-written. However, the software appears to be unfinished and some work is required to make it suitable for publication.

      Major Issues

      1. The ISA-creator and Bio-GraphIIn are cited as "helped grow the ISA community of users". The authors should offer evidence for this as (a) by our own experience ISA-creator is very hard to use and this is also reflected by the expressed opinion on ISA-creator by anyone I have met so far who has used it and (b) it is not possible to validate how Bio-GraphII has helped grow the community as the website linked to in the cited article is not available anymore and no source code is available, e.g., on Github. The Google groups forum has less than 10 threads per year, with 2 in 2020 so far and one in 2019. The authors should balance these counts with their "PyPi" download counts statistics.
      2. The authors should cite other published APIs for ISA file formats, e.g., AltamISA.
      • Kuhring, Mathias, et al. "AltamISA: a Python API for ISA-Tab files." Journal of Open Source Software 4.40 (2019): 1610.
      1. The authors should show proof for "efficiency" of their object-oriented model, e.g., by comparing import efficiency with that of altamISA. I'm raising this point as some users raised questions on efficiency when loading/writing data files in the ISA-API Github Issues.
      2. The authors write that development is in progress but it appears from the Github code frequency graph that development has mostly stalled since 2018.
      3. The authors should explain in more detail how stable their API is and what the limitations and assumptions are. In my opinion, one important point in data import and export is looking how data looks after a "round-trip", e.g., import ISA-Tab, followed by export ISA-Tab. I have done this on the official ISA data sets (https://github.com/ISA-tools/ISAdatasets, commit f20be4f83dc5f6f7ec419bfd634efba3177e4ae4). Here are the (to me unexpected results for official example data): (a) On BII-I-1, whole columns disappear such as the first "Material Type" column, (b) All other datasets fail to parse and parsing crashes with Python exceptions. I think the authors should work on these points. It cannot be judged whether the software can be published this point. The software appears unfinished and some more work has to go into it to allow for publication. ## Minor Issues
      4. The authors should provide more automated tests for their software. In 2018 when we tried out the package we found some inconsistencies and problems but found it hard to fix bugs in the large body of software because of the lack of comprehensive automated tests.

      Reviewer 3; Chris Hunter The manuscript is well written and coherent, it provides a nice balance of historical context of ISA-Tools and the current release of the ISA-API. As a biologist and Biocurator I can attest to the importance of simple to use tools for curation of datasets, and the ISA-creator has been well used by the community over the years. The addition of the ISA-API should allow for more repositories to incorporate the use of ISA formats as both import and export formats for datasets. I have to admit that my lack of experience as a developer means that I am in no position to actually test the API's functionality so I cannot comment as to the technical suitability of the implementation or even whether is works or not! I have been requested to review the manuscript with specific reference to the original reviewer 2 concerns: Reviewer 2 Comment 1"The ISA-creator and Bio-GraphIIn are cited as "helped grow the ISA community of users". The authors should offer evidence for this as; (a) by our own experience ISA-creator is very hard to use and this is also reflected by the expressed opinion on ISA-creator by anyone I have met so far who has used it and (b) it is not possible to validate how Bio-GraphII has helped grow the community as the website linked to in the cited article is not available anymore and no source code is available, e.g., on GitHub. The Google groups forum has less than 10 threads per year, with 2 in 2020 so far and one in 2019. The authors should balance these counts with their "PyPi" download counts statistics." My comment: I believe the authors have addressed the primary concern about the evidence of continued growth in the ISA user community with the detailed description of the PyPi download statistics. The issue of ISA-creators user experience by the reviewer and anecdotal comment of all who have used it, is unfounded and in-fact if true, adds to the argument for the implementation of the ISAAPI as a means to allow a wider developer-base to improve the ISA-creation experience. Reviewer 2 comment 2. "The authors should cite other published APIs for ISA file formats, e.g., AltamISA."

      • Kuhring, Mathias, et al. "AltamISA: a Python API for ISA-Tab files." Journal of Open Source Software 4.40 (2019): 1610. My comment: The authors have made appropriate changes and included the suggested reference. Reviewer 2 comment 3. "The authors should show proof for "efficiency" of their object-oriented model, e.g., by comparing import efficiency with that of AltamISA. I'm raising this point as some users raised questions on efficiency when loading/writing data files in the ISA-API GitHub Issues." My comment: The authors have replaced the word efficiency with coherent in the manuscript to clarify the meaning in the relevant paragraph. However I'm not sure they have addressed the principle of the concern raised by reviewer 2, i.e. how does the ISA-API compare to other existing models in terms of efficiency? As I have no idea how to measure "efficiency" of a model I'm not convinced this is a valid request from reviewer 2. Reviewer 2 comment 4. "The authors write that development is in progress but it appears from the GitHub code frequency graph that development has mostly stalled since 2018." My comment: I agree with the authors rebuttal of this point, simply looking at GitHub commits is not a suitable measure. Reviewer 2 comment 5. "The authors should explain in more detail how stable their API is and what the limitations and assumptions are. In my opinion, one important point in data import and export is looking how data looks after a "round-trip", e.g., import ISA-Tab, followed by export ISA-Tab. I have done this on the official ISA data sets (https://github.com/ISA-tools/ISAdatasets, commit f20be4f83dc5f6f7ec419bfd634efba3177e4ae4). Here are the (to me unexpected results for official example data): (a) On BII-I-1, whole columns disappear such as the first "Material Type" column, (b) All other datasets fail to parse and parsing crashes with Python exceptions." My comment: Unfortunately my lack of the required skill set to make any sort of tests myself means I am not in a position to adjudicate on this point! I do agree with the authors rebuttal that they cannot assess the reviewers issues based on the minimal information provided in the review. As the authors point out, documentation can always be improved, and 1 such improvement might be to include a "round-trip" example as the reviewer 2 has attempted to show that one can take a valid ISA formatted input, convert it to say SRA format, and back to ISA format using the API and that the input and output ISA formats do indeed match. Reviewer 2 comment Minor issue 1. "The authors should provide more automated tests for their software. In 2018 when we tried out the package we found some inconsistencies and problems but found it hard to fix bugs in the large body of software because of the lack of comprehensive automated tests." My comment: I think this reviewers comment is un-related to the review, they are talking about a version of the tool that is approximately 3 years old, not the current version that they are meant to be reviewing. Despite the irrelevance, the authors have responded by adding text to highlight the Test Driven Development approach taken in the project. With the one caveat already mentioned, i.e. I am unable to actually test the code so I am reliant on the other reviewer to have covered that aspect of the review, I believe the manuscript is suitable for publication as the authors have adequately addressed all of the reviewer 2 comments with the possible exception of improved documentation.
    1. Now published in GigaScience doi: 10.1186/s13742-015-0046-9

      A version of this preprint has been published in the Open Access journal GigaScience (see paper https://doi.org/10.1186/s13742-015-0046-9 ), where the paper and peer reviews are published openly under a CC-BY 4.0 license.

      These peer reviews were as follows:

      Reviewer 1: http://dx.doi.org/10.5524/REVIEW.100015 Reviewer 2: http://dx.doi.org/10.5524/REVIEW.100017

    1. Now published in GigaScience doi: 10.1186/2047-217X-3-22 Joshua Quick 1Institute of Microbiology and Infection, University of Birmingham, Birmingham, UK 2NIHR Surgical Reconstruction and Microbiology Research Centre, University of Birmingham, Birmingham, UK Find this author on Google ScholarFind this author on PubMedSearch for this author on this siteAaron Quinlan 3University of Virginia, Virginia, US Find this author on Google ScholarFind this author on PubMedSearch for this author on this siteNicholas J Loman 1Institute of Microbiology and Infection, University of Birmingham, Birmingham, UK Find this author on Google ScholarFind this author on PubMedSearch for this author on this siteFor correspondence: n.j.loman@bham.ac.uk

      A version of this preprint has been published in the Open Access journal GigaScience (see paper https://doi.org/10.1186/2047-217X-3-22), where the paper and peer reviews are published openly under a CC-BY 4.0 license.

      These peer reviews were as follows:

      Reviewer 1: http://dx.doi.org/10.5524/REVIEW.100173 Reviewer 2: http://dx.doi.org/10.5524/REVIEW.100172

    1. Abstract

      A version of this preprint has been published in the Open Access journal GigaScience (see paper https://doi.org/10.1186/2047-217X-3-3), where the paper and peer reviews are published openly under a CC-BY 4.0 license.

      These peer reviews were as follows:

      Reviewer 1: http://dx.doi.org/10.5524/REVIEW.100110 Reviewer 2: http://dx.doi.org/10.5524/REVIEW.100109

  2. Dec 2021
    1. Although

      Reviewer 1. Aboozar Soorni

      Have any claims of performance been sufficiently tested and compared to other commonly-used packages? no Additional Comments Are there (ideally real world) examples demonstrating use of the software? No Additional Comments Is automated testing used or are there manual steps described so that the functionality of the software can be verified? No

      Recommendation: Accept

      Reviewer 2. Weiwen Wang This manuscript is easy to read, with good writing and details of methods. By comparing the long-read, short-read and hybrid assembly, this manuscript found out the best approach to assemble plastid and mitochondrial genome. Additionally, authors considered the multiple structures of plastid and mitochondrial genome, and providing a new and carefully designed method to assemble and assess the complex mitochondrial genome. While this manuscript represents a solid work and it was interesting to read it, I have some minor concerns which should be fixed to improve the quality of the manuscript. Line 240-247: Maybe it could be easier to understand if authors can number each contig. For example, “The assembly graph suggests the typical quadripartite structure of a LSC (contig 1-7) as the larger circle in the graph…...”. In some figures, authors numbered the contigs, but in some did not. Also, in the figure 2, why does the SSC region also have almost 2x coverage (1.88x)? Line 253-258: In the figure 3, it is three contigs (92 kb, 38 kb and 5kb) rather than two contigs (81 kb and 92 kb) that this manuscript described. I guess authors put the wrong figure? Line 283-285: It is a smart method to clearly show the assembly details. Line 385-387: Does it mean that the black segment (edge 11) in Figure 12 is consisted of two highly similar (or the same) regions? Have authors tried to do a simple BLAST to confirm this?

    2. Abstract

      A version of this preprint has been published in the Open Access journal GigaByte (see paper https://doi.org/10.46471/gigabyte.36), where the paper and peer reviews are published openly under a CC-BY 4.0.

    1. Abstract

      A version of this preprint has been published in the Open Access journal GigaScience (see paper https://doi.org/10.1093/gigascience/giab081), where the paper and peer reviews are published openly under a CC-BY 4.0.

      These peer reviews were as follows:

      Reviewer 1. Qi Zhou

      Mueller et al. presented a high-quality genome and annotation of tufted duck with combined long-read and short-read techniques here. Tufted duck shows a different susceptibility to avian influenza A viruses (AIV) compared to mallards that share the habitat. So besides adding a new avian genomic resource, tufted duck genome may facilitate the research into the genetic basis of AIV infection. Overall, I think the genome is of high quality, but I do have several comments below:

      The introduction is largely devoted to the great advantage of PacBio over Illumina techniques in elucidating the non-model species' genome feature. This is not needed for the authors of Gigascience. I suggest the authors provide more information in the tufted duck. From the previous studies, how diverged in terms of million years and sequence level between the tufted duck vs. mallard? What is the phylogenetic position of tufted duck in Galloanserase? Are there any lab or field studies of tufted duck's susceptibility to AIV? What is the potential genetic cause? Also, since it is known in mallards that RIG-I is responsible for the AVI response, is this gene then intact or how is this gene expressed in the newly presented tufted duck genome? The analyses part needs to show the repeat content of tufted duck and its comparison to other avian genomes. Particularly, the repeat content of Z and W chromosome. Did the author look into the centromere or telomere sequences? Are Z and W chromosomes assembled into two intact sequences? If so, evidence is needed to show that there is no chimeric assembly between Z and W, or other autosome sequences, as it is mentioned in the paper that 'most of the genome separated into haplotypes'. Tissue-specific expression part: Here what does 'supported' gene mean exactly? Just to make sure, the authors means 'genes' or 'transcripts'? 'Stringtie2 may discard single-exon transcript model..' Did the author find that the Stringtie2 results generally have a much lower proportion of single-exon transcripts compared to say, Iso-seq data? 'The average number of transcripts in the long-read pipeline often almost matched..': I am confused here that Figure 5 shows the opposite result that the supported PacBio genes are much lower in number than those produced by Illumina reads. Any results to support this claim? 'This distribution is much more balanced in the long-read pipeline..' here the authors may suggest that PacBio iso-seq recover more alternative splicing transcripts compared to Illumina data. But it is unclear that the supported genes of iso-seq are so much lower in number than those of Illumina, which may be caused by the relatively lower coverage of iso-seq? So I would conclude at least in this study, both techniques are complementary to each other, rather than one performing better than the other. How are the TEs annotated by these small RNAs, as apart from miRNAs, there should be large portions of piRNAs mapping to TEs. Re-Review: For question 3: The author need to explain more about why they think Figure S1 shows there is no chimeric assembly between Z and W chromosome. As Figure S1 is just a Hi-C matrix plot, among the submitted materials, I also cannot find the legend explaining the figure.

    2. Background

      Reviewer 2. Joshua Peñalba This data note by Mueller et al. describes the high-quality, chromosome-scale assembly of the tufted duck and the gene annotation using both Illumina and PacBio sequencing. The authors present and compare the resulting annotations from the different sequencing platforms which is useful for researchers intending to do RNAseq for annotation.

      I think the details in the note focuses primarily on the gene annotation comparison and the genome assembly has not received adequate attention. Since this will likely be the data note that reports on the genome assembly, it should probably have more details on the chromosomes (in detail below). I understand that one can go into an exhaustive description but I think these as a minimum will give the reader a good idea about the quality of the genome assembly:

      Are the of 34 autosomes + sex chromosomes expected? Was there an a priori expectation based on the karyotype or based on the mallard genome assembly? Was this expectation provided during the scaffolding using HiC? Does the assembly size match the expected genome size based on an independent estimate? Since this is a chromosome-scale assembly, what are the metrics of individual chromosomes? I see in NCBI that the chromosome numbers have been assigned, is this based on size or homology to chicken chromosomes? What are the lengths of each chromosome? GC content? Gene content? Gaps? How many contigs were scaffolded by HiC to build each chromosome? More detail is needed regarding how the manual curation is done so it can be repeatable by other researchers. What is the sequencing effort (# lanes, # SMRT cells, etc.) and resulting coverage from each technology of the genome? This doesn't have to be very detailed if it is reported elsewhere but some idea for the reader will be helpful. I am aware that the VGP pipeline was used for the assembly, is there a GitHub for the pipeline that can be included which has the specific commands and flags for each step? Since the annotation comparison was exhaustive, I don't have as many comments on it. The authors may not be explicitly making a recommendation on which approach to use but what is a good metric that the readers can use to compare the results considering the orders of magnitude difference in sequencing coverage?

      Regarding the Illumina and PacBio annotation comparisons, since the coverage are substantially different, in what metric are they comparable? Is it similar in sequencing cost? Would the PacBio still underperform in terms of recovered genes if it had the same coverage as the Illumina libraries? What was the sequencing effort for the PacBio IsoSeq?

    1. Abstract

      This paper has been published by GigaByte, which openly shares its peer reviews under a CC-BY4.0 licence.

      Reviewer 1. Qiye Li Are all data available and do they match the descriptions in the paper? No The available of all raw sequencing data generated in this study are not stated. And it would be appreciated if the authors could provide a table summarizing all the sequencing data generated in this study.

      Is the data acquisition clear, complete and methodologically sound? Yes Could you also provide the gender information for woy03?

      Is there sufficient detail in the methods and data-processing steps to allow reproduction? No

      L145-146: It is unclear how the authors determined full-length protein-coding genes by BLAST against the Swiss-Prot non-redundant database. It would be appreciated if the authors could provide more details here.

      L183: The authors indicated that 15,904 of the 24,655 protein-coding genes were supported by mRNA evidence and 1,309 by protein evidence. Does the mRNA evidence come from the RNA-seq data? Where does the protein evidence come from?

      Is there sufficient data validation and statistical analyses of data quality? No

      L233: Contaminating sequences in the reference genome are noteworthy, as the DNA for genome sequencing was extracted from wild animals that were dead before sampling. However, I would say high mapping rates did not necessarily represent low contaminating DNA, as the contaminating DNA (e.g. from bacteria), if exists in your dataset, might have been assembled as part of the woylie reference genome. It is unclear if the authors have submitted the genome to NCBI. If so, I think they should have got a report about contamination from NCBI.

      It would be appreciated if the authors could provide some more statistics for protein-coding genes (e.g. Mean gene size, Mean exon number per gene, Mean exon length and Mean intron length) and compare these metrics to other marsupials. This will be helpful to judge the quality of the gene models.

      Is the validation suitable for this type of data? Yes

      Is there sufficient information for others to reuse this dataset or integrate it with other data? Yes

      Recommendation: Minor Revision

      Reviewer 2. Parwinder Kaur Well presented document with good data and analyses practices.

      Recommendation: Accept

      Reviewer 3: Walter Wolfsberger Is there sufficient information for others to reuse this dataset or integrate it with other data? Yes

      The submission body and the table 1 state the following assembly stats of the genome assembly that seem to indicate some potential issues: Genome size (Gb) 3.39 No. scaffolds 1,116 No contigs 3,016 Scaffold N50 (Mb) - 6.94 Contig N50 (Mb)- 1.99

      The main issue here for me lies in Scaffold N50 in relation to other parameters, when in comparison with the assemblies using similar methodological approach.

      This can either be good or bad, as these numbers might indicate an issue during scaffolding, or presence of long top assembly scaffolds (which is great). I believe, that the submission would significantly benefit if this information is mentioned and discussed.

      The approach used to generate the assembly seems to utilize 10x PE sequences to scaffold the assembly. There are hybrid assembly approaches available, that leverage short reads to improve the assembly quality, given the slightly limited coverage of PacBio HiFi reads (approx. 12x).

      Recommendation: Minor Revision

      Re-review: The authors addressed all my assembly-related comments in sufficient manner and provided updates that will benefit the manuscript and data released with it.

  3. Nov 2021
    1. The mule deer

      Reviewer 2. Dr. Rebecca Taylor

      Are all data available and do they match the descriptions in the paper? Yes. I checked the two links included in text as well as the NCBI data availability and all data is available for download with good explanations as to what all the files are.

      Are the data and metadata consistent with relevant minimum information or reporting standards?<br> No. On the whole the information included for the data and metadata is good, but perhaps some more information about the sample used would be beneficial. It is stated that the sample came from 'Woodland Hills, Utah'. I assume this is in the United States? Some more information about the environment would be beneficial for those unfamiliar with the area. It is also not stated which subspecies the sample belongs to. A map of the species range and where this sample is from could also help. Additionally, in the 'Background and context' section, you state that 'genetic resources available for Odocoileus spp. are limited to a variety of microsatellite loci' with the exception from Russell et al. I think you need to do a more thorough search – for example I have used a sitka deer genome (Odocoileus hemionus sitkensis) as an outgroup for my work, sequenced by the CanSeq150 program, found on the NCBI under Bioproject PRJNA476345.

      Is the data acquisition clear, complete and methodologically sound? No. I was a bit confused by the sentence 'The assembled mule deer genome has a total length of 2.61 Gbp with a GC content of 41.8% and a contig N50 of 28.6 Mbp (Table 1) with a longest contig of roughly 96.5 Mbs' occurring before the 'Chromosome-length Scaffolding' section. Are these assembly statistics for the version of the genome before chromosome scaffolding? It would be better to report the final assembly statistics for the chromosome scale assembly, or if these are the final statistics then move those results until after the 'Chromosome-length Scaffolding' section. For Table 1, it might be nice to include the L/N90. The most standard are the N/L 50 and the N/L90, so you don’t necessarily need all of the others. Additionally, I could not find where it is stated how many 'chromosomes' in the final assembly. Is this a chromosome assembly or (more likely) a chromosome scale assembly (so there are also other scaffolds not included in the main ‘chromosomes’)? Relevant paper newly published might be good to reference: Yamaguchi et al. Technical considerations in Hi-C scaffolding and evaluation of chromosome-scale genome assemblies. Molecular Ecology. I also think it would be beneficial to know what the coverage of the files you used for the PSMC analysis were, making sure to filter out ~double the average (I can see that you filtered for a maximum of 90X but I don't know whether this is appropriate or not). Also a citation for the generation time used would be beneficial as this strongly influences PSMC results. It would also be good to explain the rise in effective pop size seen recently in the white tailed deer. Is this a real pattern or because PSMC can be spurious at more recent times? This is also a reason why it would be good to know the depth of both files used here. Could the different demographic histories be caused by competition between the species? I am not an expert on these species but I was just curious given their contrasting demographic histories.

      Is there sufficient data validation and statistical analyses of data quality? Not my area of expertise

      Is there sufficient information for others to reuse this dataset or integrate it with other data? No. As I stated above, information about which subspecies this individual is from in text would be beneficial.

      Recommendation: Minor revision

    2. ABSTRACT

      This paper has been published in GigaByte Journal (https://doi.org/10.46471/gigabyte.34), where it and the open peer reviews are published under a CC-BY 4.0 license.

      Reviewer 1. Dr.Endre Barta See comments in additional submitted file. Recommendation: Minor revision

    1. Mycobacterium avium subsp

      Reviewer 2. Dr. Nabeeh A. Hasan Is the language of sufficient quality? Yes. A few minor grammatical edits could be done.

      Are all data available and do they match the descriptions in the paper?<br> No. The data are not currently accessible by the public in NCBI.

      [The curators will be in touch to make sure all the data is live - see GigaDB guidelines on the data they require http://gigadb.org/site/guide]

    2. Abstract

      This article has been published open access in GigaByte Journal (https://doi.org/10.46471/gigabyte.33), which also publishes open peer reviews under a CC-BY4.0 license.

      Reviewer 1. Dr.Astrid Lewin

      Is there sufficient detail in the methods and data-processing steps to allow reproduction?

      Chapter „Methods, b) Bacterial isolation and DNA extraction”: Lines 139-140: There is a discrepancy between the method of DNA extraction as described in the reference (16) and the manuscript text. While according to the reference the bacterial pellet is dissolved in acetone, the manuscript text describes a treatment with chloroform and methanol. This should be clarified.

      Is there sufficient data validation and statistical analyses of data quality? Not my area of expertise

      Any Additional Overall Comments to the Author<br> Chapter „Methods, b) Bacterial isolation and DNA extraction”: Chapter “Data Validation and quality control, Identification of MAH” Lines 218-220: It is true, that the isolates had the highest identity with one of the three MAH strains, but not with all of the three MAH reference strains. For example, isolate OCU468 has 98.69% identity with MAH TH135 but 98.79% identity with MAP K-10. The degree of identity seems to be highly dependent on the choice of strains. Therefore, this comparison may not be very significant. In my experience, growth at 42°C very well distinguishes MAH from the other M. avium subspecies.

      Recommendation: Accept

    1. Abstract

      This paper has been published by GigaScience (doi:10.1093/gigascience/giab076) which publishes the peer reviews openly under a CC-BY license.

      Reviewer 1. Ayush Dogra An overview of the National COVID-19 Chest Imaging Database: data quality and cohort analysis Comments- 1) Abstract is not much convincing and informative. Please refine. 2) What is the motivation of this work? Please include in manuscript. 3) Author can provide more appealing block diagram for figure 1. 4) Inclusion Criteria section is bit ambiguous. How these certain criteria are decided? Justify. 5) How your manuscript is different from other manuscripts? Kindly include in manuscript. 6) Refine the discussion part. 7) There are few linguistic and grammatical errors. Please correct. 8) Similarity index must be less than 10 percent .

      Reviewer 2. Chris Armit This excellent Data Note provides an overview of the National COVID-19 Chest Imaging Database (NCCID), which is a centralised repository that hosts DICOM format radiological imaging data relating to COVID-19. By the very nature of this resource these data have immense reuse potential. The NCCID is the first national initiative of its kind - led by NHSX, British Society of Thoracic Imaging, and the Royal Surrey NHS Trust and Faculty - and the database hosts approximately 20,000 thoracic imaging studies related to SARS-CoV2 admissions from 20 NHS Hospitals / Trusts across England and Wales. Of note, the NCCID is additionally registered on the Health Data Research UK platform, with a platinum metadata rating which is a commendable achievement.

      As part of this review, I used the NCCID Data Access Agreement, NCCID Data Access Framework Contract, and NCCID Application Form to gain access to the NCCID Project WorkSpace. This WorkSpace utilises the very powerful and highly intuitive faculty.ai platform to run Jupyter Notebooks on a remote server where the NCCID data can be accessed. I was impressed that the faculty.ai platform allows very many different views of the NCCID data, for example one option was to view the data by Scanner Type. This is an important consideration from a deep learning reuse perspective as it is known that different X-ray / CT scanners can introduce different artefacts, and this can confound multisite analysis (for example see Badgeley et al., 2019, https://doi.org/10.1038/s41746-019-0105-1). I find that by NCCID organising the imaging data in this way particularly helpful for addressing this issue.

      I was additionally impressed that the NHS Analytics Unit was willing to provide an Onboarding Session to help a naïve user navigate the faculty.ai platform more effectively, and to provide one-on-one tuition on how the interface can be used for image analysis. I used this session to explore the functionality of the DICOM viewer that can be used to preview NCCID thoracic images. A Javascript viewer enables a user to open DICOM images and explore the image histogram of intensity values and I see this as a useful means of assessing, for example, contrast stretching in radiological image data that has been submitted to NCCID. As a follow-up to this Onboarding Session, there is now the additional option to launch a static viewer that offers a higher quality preview image of NCCID DICOM data. I find this functionality exceptionally helpful as it enables an end-user to preview image data and to visually inspect, for example, glassy nodules in COVID-19 thoracic image data prior to data download. I thank the NHS Analytics Unit for further developing the image visualisation capabilities of the NCCID Project WorkSpace as part of this review process. On this note I wish to highlight that, of the two viewers, I found the static viewer particularly helpful for assessing image quality of CT scans which was excellent.

      I was further impressed that the thoracic imaging data includes a positive cohort with COVID-19, but also a negative cohort consisting of individuals with a negative swab test, but who may have a different underlying respiratory condition. This is an important consideration and it enables this dataset to be used for machine learning and deep learning approaches that could be used to distinguish between COVID-19 and other respiratory conditions in what remains a clinically relevant challenge.

      Importantly, the code for the NCCID data warehouse and the Data Cleaning pipeline utilised in the paper are Open Source and available on GitHub (https://github.com/nhsx/covid-chest-imaging-database ; https://github.com/nhsx/nccid-cleaning) where they have been ascribed OSI-approved MIT licenses.

      This is an excellent Data Note and I recommend this manuscript for publication in GigaScience.

      Minor comments

      1. The MTA is tailored towards breast cancer screening. For example, there are the following definitions: "Source Database" means the assembled collection of images collated from the research project entitled 'OPTIMAM: Optimisation of breast cancer detection using digital X-ray technology'. "Related Data" means any and all pathological and clinical data associated with the Database Images supplied by or on behalf of CRT or Surrey to Company under this Agreement, in particular but without limitation, this may be identified regions of interest in the Database Images, the age of the woman at the date the relevant Database Image was taken, details about previous screening events, patient history, X-ray, ultrasound assessment, details of biopsy procedures and surgical events - all in a structured format representative in structure, format, quality, content and diversity of the Source Database.

      Can the authors please confirm that this MTA is suitable for thoracic radiology in the mixed sex COVID-19 study outlined in the accompanying preprint?

      1. In support of the manuscript, I further recommend that a copy of the NCCID Data Access Agreement, Data Access Framework Contract, Application Form, and snapshots of the code (GitHub archives) be archived in the GigaScience DataBase (GigaDB).
  4. Oct 2021
    1. Background

      Reviewer 2. Hugo Schnack This manuscript reports on the results of a study that can be split into two parts. For this, it should be noted that the authors consider three categories of quantities. The first category are the input data, or 'predictors': (a) variables derived from MRI scans and (b) rich sociodemographic variables. The second category, or 'target variables', as the authors call them, include: (a) age, (b) fluid intelligence and (c) neuroticism. In the first part of the study, using machine learning, predictive models are built to predict the target variables from the input variables. The resulting predictions are called 'proxy measures'. For the second stage, a third category of variables is included, the 'real world health behaviours', such as alcohol use and physical activity. The authors now set out to predict these measures of behaviour based on the measures of the second category, either the 'real ones' or the 'proxies'. Thus, the question is, can alcohol use be better predicted by neuroticism determined from a questionnaire, or by the neuroticism proxy derived from MRI and sociodemographics? The main results are presented in Figure 2, and the conclusion made by the authors is that the proxies perform better than the real measures.The authors carry out additional analyses, including the study of the relative importance of MRI and sociodemographics. The authors suggest that these proxies may have clinical use in the future. At first sight it may seem surprising that proxies perform better then the real measure in capturing the associations, but, as the authors mention, the real measures suffer from (measurement) noise and non-objectivity. However, the proxies are biased (in the sense of being to simple) and are thus less capable of modeling the (true) individual variation. I would have expected a more in depth discussion about this. Apart from this, there is an asymmetry in the way age is treated as compared to the other two target variables, intelligence and neuroticism. Age is a very hard measure, without any measurement error, and independent of the brain. The other two targets, intelligence and neuroticism, are softer measures, and directly related to the brain. How does this influence the analyses and the results? Indeed, not 'predicted age' is used as proxy, but 'brain age delta'. I would have liked to see more explanation and discussion about this. Finally, the suggested clinical use of the proxies is not supported well enough in my opinion. Maybe the authors could add more this discussion to this point as well. All in all, this is a scientifically interesting study, but I think the presentation could be improved, by more clearly stating the aims of it, and by giving more insight in certain aspects of the 'proxy modeling'.

    2. Abstract

      This paper has been published in GigaScience, where the peer reviews are published openly under a CC-BY 4.0 open license.

      Reviewer 1.Bo Cao Reviewer Comments to Author: The manuscript describes an application of Machine Learning (ML) models for the quantification of psychological constructs, e.g. fluid intelligence and neuroticism, using multi-mode MRI data from a large population cohort, the UK biobank data. They show that the proxy measures of these psychological constructs are more useful compared to the original constructs for characterizing health behaviors. Overall, the manuscript is well written. The research questions are clearly stated and are of practical importance. However, the reviewer has following concerns.

      Major Concerns:<br> 1) In page 3 (left, lines 3-6 of the main text), the author claims that "Our findings suggested that psychological constructs can be approximated from brain images and sociodemographic variables - inputs not tailored to specifically measure these constructs.". The reviewer has concerns about this claim. Although Figure 3 shows the model's performance in predicting age, fluid intelligence and neuroticism using neuroimaging data and different areas of sociodemographic data, the performance of the models in predicting the psychological constructs, fluid intelligences and neuroticism, may not be good enough to support such a claim. 2) In Figure 2, the proxy measure and original measure show similar associations with the health phenotypes for fluid intelligence (center plot) and neuroticism (right plot), but not for the brain age delta. The main reason seems to be when doing the association analysis, the measures of the health phenotypes are de-confounded for their dependence for age (In the subsection "Out-of-sample association between proxy measures and health-related habits" of the "statistical analysis" section). However, it seems the same procedure is not applied for the association analysis of fluid intelligence and neuroticism. The estimated brain age or brain age gap depends on the age. Thus, we need to either correct the brain age or brain age gap for its dependence on the age, or de-confounded the health phenotype's dependence on age. If the author wants to derive the proxy measure of the psychological construct in the same as the brain age (or biological age), same procedure should be used to correct the proxy measure's dependence on the original measure. 3) Based on Figure 2, the author claims that the proxy measures have enhanced association with health behavior compared to the original measures. If we only focus on the central and right part of the Figure 2, the difference is not that obvious. We do not know if the difference is significant or not. A better approach maybe is that correct the predicted fluid intelligence and predicted fluid intelligence for their dependence on the original measures or de-confounded the original measures' effects on the health behaviors.

      Minor concerns: 1) In page 1 (two lines before reference 15), it seems that "to learn" is mis-spelled into "tolearn".

      2) The author stated that there are repeated measures for subjects in UK biobank data. How the author tackles this issue in their data preprocessing? Using the last one or the first one or something else?

      3) The selection 5,587 out of all the 10,975 subjects for the modeling, while the left part is for the out-of-sample association analysis. The selection seems arbitrary. Can the author also show a learning curve, in which x is the sample size and y is the model's performance, to justify their choice is enough to train an accurate ML model?

      4) In the first paragraph of the "Methods" section, there are duplications.

      5) In the subsection of "Data acquisition" part, under the "target measures" paragraph, the age at the baseline recruitment is used as the outcome. However, in general, there is a gap between the age at baseline and the age when the MRI images were acquired. Does this matter for the data analysis in this manuscript.

      6) For the classification analysis (paragraph "Classification analysis" in the subsection of "Comparing predictive models to approximate target measures", and the paragraph above the "Discussion" section), the thresholds selected to discretize the outcome variables are kind of arbitrary.

      Comments on Re-Review: The substantial revision improved the paper and is appreciated by the reviewer. The details have been enhanced. However, the reviewer still has some concerns about the basic logic and its presentation of the paper after reviewing all the comments from other reviewers and the feedback from the author. Figure 1 is helpful (BTW, the font is too small and smaller than other figures). But if we consider the current approach again, when the machine learning (ML) has perfect performance to generate the so called "proxy measures", these measures should match exactly each individual's age, fluid-intelligence and neuroticism. What the author claimed about proxy measures providing better assessment to other health related variables might be simply due to the imperfectness or the "residuals" from ML prediction to the real targets (age, fluid-intelligence and neuroticism). The author may need to address this and present the logic of the paper in a clearer way to help the readers understand the main point and results of the paper. In this regard, Figure 1 is incomplete in addressing the full flow of the paper, which is necessary for such a seemingly complex paper in the reviewer's opinion.

    1. Abstract

      A revised and updated version adapted from this preprint was published on 6th September 2018 in GigaScience called:

      Sequana coverage: detection and characterization of genomic variations using running median and mixture models https://doi.org/10.1093/gigascience/giy110

      As an open access, open peer review journal the peer reviews of this paper are available here:

      Review 1. http://dx.doi.org/10.5524/REVIEW.101353 Review 2. http://dx.doi.org/10.5524/REVIEW.101350 Review 3. http://dx.doi.org/10.5524/REVIEW.101351

    1. The Bicolor Angelfish

      Reviewer 2. Ole K. Tørresen Is the language of sufficient quality?<br> No. Almost every second sentence in the abstract would need work, and so is the rest of the manuscript.

      General comments: The authors have created a chromosome-level genome assembly of bicolor angelfish using stLFR and HiC libraries.

      The language in this manuscript needs some work. After commenting on every second sentence in the abstract regarding some language matter, I saw that I couldn’t continue commenting all these matters. Please do a good clean-up in the language, so that it is easier to read. I’ll point out some issues during the manuscript, but will not find all and I can’t manage to point out all I do find.

      Specific comments: Line 19: «...special and beautiful two-color body” is a bit subjective. Maybe something like “…remarkable and striking two-color body” instead?

      Line 20: I know this is the abstract, but I don’t understand what “the mechanism of bicolor body” could mean. Maybe rephrase?

      Line 22: I’ve seen this many places, but it should be a lower-case k in kb, not upper case like Kb. The k stands for kilo which is a metric prefix meaning thousand (https://en.wikipedia.org/wiki/Metric_prefix).

      Line 25: “As we are known,” should be “as far as we know”.

      Line 27: “Future research” instead of “future researches”.

      Line 46: Which protocol are you talking about?

      Table 1: How can you end up with more “valid data” than “raw data”? Did you mix up something here? It looks consistent with the text, but there’s likely something wrong.

      Recommendation Minor Revision

    2. Abstract

      This paper has been reviewed by GigaByte Journal and all peer reviews are shared CC-BY 4.0.

      Reviewer 1. Claudius F Kratochwil. Is the language of sufficient quality?<br> No. The text is understandable, but has many grammatical errors. The manuscript would greatly improve through language editing.

      Are all data available and do they match the descriptions in the paper?<br> Yes. I did not check every single file, but all data I looked for I found to be publicly available. It would help if the "Availability of supporting data and materials" statement would be a bit more comprehensive. Data A is deposited under X, Data B is deposited under Y-Z instead of just providing the project ID.

      Are the data and metadata consistent with relevant minimum information or reporting standards? Yes. To the best of my knowledge. Not my area of expertise.

      Is the data acquisition clear, complete and methodologically sound?<br> No. I was lacking information about the transcriptomic data (it says in line 44 that RNA was extracted) that was used for the annotation? Was RNA only extracted from the muscle? Maybe the caveats that go along with that should be discussed. How was the data processed? How many reads etc. I think the manuscript lacks information about this unless I misunderstood where the data for the "transcript-based prediction" came from. Then this should be indicated more clearly.

      Is there sufficient detail in the methods and data-processing steps to allow reproduction?<br> Yes. To the best of my knowledge. I am not an expert on this. Minor comments: l 46: Which protocol?

      Is there sufficient data validation and statistical analyses of data quality?<br> Not my area of expertise. One thing that could be probably additionally done is to provide dot plots with the 1-2 more closely related species with chromosome level assemblies (probably Tilapia or Medaka).

      Is the validation suitable for this type of data?<br> Yes. As far as I can judge the analysis is fine.

      Is there sufficient information for others to reuse this dataset or integrate it with other data?<br> Yes. Genome and annotation are available, which is the most important for reuse and integration with other data sets. So as far as I can judge there is sufficient information for others.

      Any Additional Overall Comments to the Author:<br> From my viewpoint, this is a useful chromosome-level genome, so I support its publication. Beyond being a useful resource, I was however a bit disappointed by the 'scientific part' regarding the bi-color body formation. While the pigmentation of the bicolor angelfish is certainly a very exciting phenotype, the analysis performed is far too superficial to give any solid insights into the phenotype. I would suggest the authors toning this down in title, abstract and main text. It is fine to mention this as a future research direction and to state that the performed initial analysis (fig. 6 and 7) might aid these investigations, but the data does not permit further conclusions. Especially as GigaByte does not focus on analyses for biological findings, this should be completely sufficient.

      Recommendation: Major Revision

      Re-Review: I am happy with the changes made and thank the authors for addressing them. The manuscript is in my opinion acceptable for publication. Congratulations to the authors for providing a reference genome for this exciting fish species to the community.

    1. Horsegram

      Reviewer 2. Penghao Wang Authors presented a paper on describing a new pseudo-chromosome draft genome sequences of a legume plant horsegram and some bioinformatics analyses based on the data. The presented assembly is of good quality and the bioinformatics analysis performed is sound. The resources made available by the study should prove valuable to researchers working on the plant and legume community on a whole. The paper is generally well written and I personally found out the paper is quite easy to follow. A few grammatical errors can be found. The bioinformatics methodology that has been utilised in the study is sound and the software used fit the goals of the study. However, authors need to present more details on some analysis components, e.g. the parameter set used for the software, the version of the software, the OS, etc, so that the analysis can be better reproduced. For example, in Methods section, line 76 the Jellyfish program was used to estimate the genome size, the parameter, version, OS of running the software were not mentioned. Line 78 SOAPdenovo2 apart from Kmer the most important parameter, what about the rest? SSPACE 2.0 was used for scaffolding, the insert sizes? Platanus, MaSuRCA, TruSPAdes, RepeatMasker, augustus, all these software involve a number of parameters, and the details on how they were used need to be provided. Because the results can be sharply different with different parameters. Some figures appear to be created by using some tools, and these tools need to be acknowledged and referenced. For example, is Circus used to generate the circular plot in Fig 5? In addition, I could not find captions for all the main figures.

      Recommendation: Minor Revision

    2. Summary

      Reviewer 1. Tianzuo Wang Is the language of sufficient quality? It can be improved better.

      Shirasawa et al. reported a Chromosome-scale draft genome sequence of horsegram, and performed the analysis of comparative genomics. 1.If Pacbio data was used, the quality of genome can be improved much. 2.Only genomes of P. vulgaris, V. angularis and L. japonicus in the legume were used for phylogenetic analysis. Soybean and Medicago, as the model legume plants, should be added at last. 3.The section of Whole genome structure in horsegram should be introduced before Diversity analysis in genetic resources, Genes related to drought tolerance, and Transcript sequencing, gene prediction and annotation. Because genome information is the foundation of other analysis.

      Recommendation Major Revision

    1. Background

      Reviewer 2. Alun Li.

      Is the language of sufficient quality? Yes

      Is there a clear statement of need explaining what problems the software is designed to solve and who the target audience is? Yes

      Is the source code available, and has an appropriate Open Source Initiative license (https://opensource.org/licenses) been assigned to the code? No Additional Comments There is no license in the github repository.

      As Open Source Software are there guidelines on how to contribute, report issues or seek support on the code? Yes. Github can be used to report issues or seek support on the code

      Is the code executable? Yes

      Is installation/deployment sufficiently outlined in the paper and documentation, and does it proceed as outlined? Yes

      Is the documentation provided clear and user friendly? Yes

      Is there a clearly-stated list of dependencies, and is the core functionality of the software documented to a satisfactory level? Yes

      Have any claims of performance been sufficiently tested and compared to other commonly-used packages? Yes

      Are there (ideally real world) examples demonstrating use of the software? Yes

      Is automated testing used or are there manual steps described so that the functionality of the software can be verified? Yes

      Additional Comments<br> Any Additional Overall Comments to the Author The paper describes an ultra-fast and accurate trimmer for adapter and quality trimming: Atria and compare it to several published tools. The tool is demonstrated to work on sequencing data with competitive accuracy and efficiency compared with existing tools.

      There are concerns that should be addressed: 1. The performance comparisons listed in Table 2 show that Atria is not extremely impressive compared with existing tools with quality trimming in percentage of the properly paired reads and the number of unmapped reads. Also, there are no more features than existing tools like Fastp, which may limit the widespread use of this software. 2. IO could be the main bottleneck for most hard-disk drivers when performing adapter trimming for compressed input/output files. So, the wall time to run different tools is also a good measurement. I wonder whether there is a significant advantage in performance if the runtime benchmark is measured by wall time. 3. Can the algorithm deal with different lengths of adapter sequences? It would be good to test out the performance of the tools with increasing length of adapter sequence. 4. L79 states that Atria is compatible with single-end data from Pacbio and Nanopore platforms, but there is no corresponding data in the paper to support the statement. Besides, the limitations of the byte-based matching algorithm make it difficult to deal with Pacbio and Nanopore sequences with high insert and deletion rates. It is necessary to describe how to get rid of these limitations in sufficient detail if they have been overcome. 5. It may be better if the description of this algorithm is presented in pseudocode especially in the section of “Matching and scoring” and “Decision rules”. 6. L165-L168, I don't quite understand why an adapter is an ideal adapter when the matching score is bigger than 10? Also, why the read pair will not be trimmed when the matching score is less than 19? Are there any reasons for the authors to set these two parameters 10 and 19 respectively? In addition, it is necessary for the authors to demonstrate that the program is robust enough for different lengths of adapter sequences. 7. All symbols in the paper should be clearly identified, e.g., L115 a1, L121 8. L135,” Because the matching algorithm requires much less time, we implement four pairs of matching to utilize properties of paired-end reads thoroughly”. The causation here does not hold.

      Recommendation Minor Revisions

    2. Abstract

      Reviewer 1. Xingyu Liao

      Is the language of sufficient quality? Yes

      Is there a clear statement of need explaining what problems the software is designed to solve and who the target audience is? Yes

      Is the source code available, and has an appropriate Open Source Initiative license (https://opensource.org/licenses) been assigned to the code? Yes

      As Open Source Software are there guidelines on how to contribute, report issues or seek support on the code? Yes

      Is the code executable? Unable to test

      Is installation/deployment sufficiently outlined in the paper and documentation, and does it proceed as outlined? Yes

      Is the documentation provided clear and user friendly? Yes

      Is there a clearly-stated list of dependencies, and is the core functionality of the software documented to a satisfactory level? Yes

      Have any claims of performance been sufficiently tested and compared to other commonly-used packages? No

      Are there (ideally real world) examples demonstrating use of the software? No

      Is automated testing used or are there manual steps described so that the functionality of the software can be verified? Yes

      Additional Comments

      Opinion: Author Should Prepare a Major Revision.

      In this paper, the authors proposed a trimming algorithm called Atria, which matches the adapters in paired-end reads and finds possible overlapped regions with a super-fast and carefully designed byte-based matching algorithm. Furthermore, Atria implements multi-threading in both sequence processing and file compression and support single-end reads. The proposed algorithm has some significance in both theory and practical application. However, I still have some questions to discuss with authors. The comments on the paper are as follows. (1) Major Comments: 1) The author highlights the fast and accurate characteristics of the proposed trimming algorithm in the title of the manuscript. However, the large amount of content in the manuscript and supplementary is to prove the advantages of the proposed algorithm in terms of speed, processing efficient, and utilization of CPU and RAM. The assessment of trimming accuracy is very limited, and it seems that only general statistics are given in Table 2 of the manuscript. I personally think that the alignment rate of reads (or the number of paired-end reads) before and after trimming is not a good proof of the accuracy of the trimming algorithm. What's more, judging from the experimental results in Table 2, the Atria algorithm does not have much advantage in accuracy compared to other methods. As the author stated in the abstract, sequence trimming is of great significance for SNP detection and sequence assembly. I very much hope to see Atria's optimization and promotion of these applications. 2) The datasets used in this study seem to be unrepresentative, and most of them can be trimmed within a few to ten seconds. The difference between a few seconds and a dozen seconds, I think most users will not care. To prove the significant advantages of the proposed algorithm in terms of efficiency, some large-scale datasets (such as several samples sequenced in the 1000 genome project) should be used. (2) Minor Comments: 1) The table2 display of line 562 is incomplete.

  5. Sep 2021
    1. Abstract

      Reviewer 1. Joon-Ho Yu Thank you for the opportunity to review this manuscript. Overall, I appreciate this argument for and description of Open Humans.  Broadly, the manuscript would benefit from greater attention to writing and organization. As my comments describe below, the "ethical analysis" offered is narrowly focused and appears to serve as a justification for the resource; yet, in its current state, I think the ethical analysis either should be removed or expanded. Ideally, the manuscript would be strengthened by a deepening and broadening of ethical considerations. Note that I use P(page)C(column)L(lines) to locate my comments for the authors.

      1. Abstract P1L36-37.  I am struck by the framing of this ethical problem as the responsibility of data subjects.  I assume this is intentional and would appreciate a little more, perhaps in the introduction, as to what is entailed in this responsibility?
      2. Abstract P1L42-43. I am not sure if the framing of the ethical problem is resolved by the description of the utility of Open Humans.  While overall, I suggest deepening the ethical problems presented, another alternative is to leave it out all together.
      3. P2C2L6-9. It would help me if parties were more clearly stated.  I think you mean researchers not research and it isn't clear to me that commercial data sources have interests but rather the companies that hold these resources do, right?
      4. P2C2 Participant Involvement.  It is unclear to me what the purpose of this section is.
      5. P2C1 Data Silos. Most of the descriptive language is written in the passive voice which I understand may be the norm but in my opinion, it unintentionally highlights how interests and responsibilities are dissociated or dis-located from stakeholders.  For instance, in the section on Data Silos, it remains unclear for whom Data Silos are a problem and whose interests have created and maintained these silos.  Again, this sort of analysis might help identify or locate solutions rather than only set up a problem that Open Human's solves.  My point here is that the developers of Open Humans need not rely on a somewhat limited ethical analysis to justify its existence and argue for its utility.
      6. P2C1L44-49. While I agree this is accurate reflection of the scope of literature, the issues raised by "big data" research now extend far beyond the common risks relayed in a consent process.
      7. P2C1L49-51. I agree that this is an important issue but this single statement citing Barbara Evans sounds a little like a strawman.  My sense is that through the efforts of many patient-driven organizations, patient and participant-driven research has increased a great deal in the past decade or so.  Perhaps this ought to be recognized especially given that many of the authors have been critical to the development of this movement.  Also, the next section on participant involvement seems at odds with the argument so some clarification might help readers understand the nuances.
      8. P2C2L53-61.  While I totally agree and appreciate these key points to the participant-centered approach to research, in all honesty, I did not come to these conclusions based on the above exposition.  I suggest moving this up as the scaffold for the introduction and reorganize based on these points.
      9. P3C1L30-36. These are the main points I think readers need in the introduction to help us understand the need for Open Humans.  I suggest you spend more time explaining these points and characterizing the evidence of these important assertions.
      10. P3C2L46-50. Could you explain the rationale behind this feature and briefly describe if more detailed information is conveyed about the IRB approval or review/determination?
      11. P4C2L25-27. This is an important statement, at least to me, but it would be helpful to reiterate how privacy is maintained, I'm assuming because its pseudonymous?
      12. P4C2L27-30. Again, what are the simple requirements?
      13. P5C1L58-C2L59. So what are the ethical implications of this use case?  I think an important point to highlight is that privacy may be a nominal issue with members of efforts like Open Humans as they often have a greater than average interest in research benefits than maintaining individual privacy. Further, I'm under the impression that personal privacy is less of a concern for many or rather our sense of what is private is changing.  Assuming I'm understanding the argument, what I'm confused about is that the ethical analysis presented in the background assumes that privacy is of central perhaps even sole concern.  Also, there are many other ethical issues that open humans both addresses possibly in a positive way and potentially raises as risks to members and even society.  So, I would welcome that analysis alongside this nice introduction to the platform or I would not rest the argument for the platform on a relatively narrow ethical frame.
      14. P6C2L16-21. Do you mean the public data are being used as training sets for the algorithms?  Are there any risks of bias based on these sorts of uses?
      15. P6C1L44-45. So are there any ethical issues related to the application of OAuth2 to these particular use cases or overall?  This isn't a trick question, I have no idea but would encourage the authors to consider based on their expertise.
      16. P7C2L9-11. Agreed, but does it also make it harder for bad actors to use these data?  It would be great if the authors could help us think about this potential trade off.
      17. P7C1 Discussion. I would like the authors to consider the following in the discussion and possibly the introduction. (1) Given that most people who engage in citizen science in the biomedical research space are likely to subscribe to the value of openness and sharing of samples, data, tools, etc., I wonder if focusing on privacy as key ethical barrier is on target and sufficient.  For instance, many of the challenges to genomic research  articulated by historically vulnerable populations have to do with offensive data uses, lack of control, lack of direct benefit, differential benefit based on SES, risks to groups, etc.  Again, a critical analysis of how this resource might increase or decrease such risks involved in citizen science would contribute to the larger project of extending citizen science or patient-led research to community-led research.  Of course, I understand this might been outside the bounds of this manuscript but that preclude some consideration. (2) I very much appreciate Open Humans as a tool that addresses the practical problem of bridging/linking/aggregating.  I have no problems with this argument yet I wonder if it is somewhat naive to assume that bridging as a practical benefit does not also risk other ethical challenges.  For example, the ease of bridging to pre-selected resources blurs the line between simply linking resources and advancing particular interpretations of the data, in fact, one's own data.  If I understand Open Humans, it is a tool that automates protocols for linking and sharing intended to facilitate citizen science and patient-led research.  The practical benefits are clear. But what are the risks associated with more automated linking and sharing?
      18. P7C2 Enabling individual-centric research and citizen science. This section is very helpful and references a number of mechanisms that begin to address, at least on an individual level, issues such as "to what uses", "control", "governance", etc.  I would love to either see this description expanded and moved up into the initial description of the resource (maybe before or around P2C2L57) and or these functional benefits better incorporated and explicated in the use cases.
      19. P8C1L13-16. It is unclear to me how it is "an ethical way" especially as it isn't clear to me what an "unethical way" would entail.   I think some pieces are presented but this argument could be much stronger and clearer.  I get that the benefits are assumed here to some extent, I've been in the same place when engaging in resource development, but perhaps a greater consideration of potential benefits and harms might help balance the focus on privacy and individual control.  Generally when we conduct ethical analysis we consider autonomy (where privacy sits), risks (as potential harms as well as increasingly benefits), and justice.  Notably. others might argue for other principles and values.  While such a comprehensive analysis isn't the focus of this manuscript, incorporating the insights of such an analysis would, in my opinion, strengthen the argument for Open Humans and signal/evidence robust consideration by its designers and authors.
    2. Background

      Reviewer 2. Birgit Wouters.

      In this paper, the authors have presented an innovative solution to the complex and multi-faceted problem of sharing personal (health) data. Open Humans, a community-based platform, serves multiple aims: (1) to be ethically justifiable: a. by focusing upon granular, individual consent for each single project, thereby avoiding the issue of compatible purposes for secondary/tertiary/... processing; b. by putting individuals in control of their personal dataset; and c. by involving them in the governance of the ecosystem; (2) to enable both academic and citizen-led research; and (3) to break open existing data silos and allow for the merging of datasets. Serving these aims simultaneously is undoubtedly ambitious. Yet, the authors have demonstrated how Open Humans is designed to do just that. The community-based platform has clearly been carefully designed, and the presentation of the design and the use cases is clear, well-written and easy to follow. Whilst Open Humans is an interesting and promising project, my comments center around the ethical justifiability of this community-based platform. Further clarification and/or elaboration on these comments is strongly recommended. One important goal of Open Humans is for research to be driven by the individuals the data come from by putting them into control of their data. The level of control is described as 'full control'. In addition, putting the participant into control of their data is regarded as important taking into account the more sensitive context of precision medicine. Under "Data Silos", the authors also mention that, next to other legislation, the General Data Protection Regulation is applicable and that the right of data portability has the potential to break open these silos. My main critique is that the article takes into account insufficiently the particularities of the General Data Protection Regulation. WHAT CONSTITUTES CONTROL? Firstly, under the General Data Protection Regulation, the individual has the following rights: right to be informed, right of access, right to rectification, right to be forgotten, right to restriction of processing, right to data portability, the right to object and, albeit less relevant in this context, rights in relation to automated decision-making. Yet, in relation to scientific research, most Member States of the European Union allow for the right of access, the right to rectification, and the right to restriction of processing to be denied. The article very briefly mentions data access, within the context of human subjects research, to be recommended but not legally required. However, it does not make mention of the other two deniable rights (right to rectification + right to restrict processing). It leads to the first main question: what exactly constitutes control? How does Open Humans define control? The article mentions and describes a granular consent and privacy model. However, consent is important, but merely a legal basis for processing. How does Open Humans guarantee the other individual rights as granted by the General Data Protection Regulation? The right to information is shortly described on page 7, and so is the right of data portability, but, if full control is the desirable route, it means guaranteeing all rights granted. However, in the context of reproducibility of scientific research, granting all rights does not seem feasible. In particular, the right of rectification and the right to restrict processing seem problematic. Further clarification/elaboration on this issue is required. Is full control the route Open Humans wants to take, or is Open Humans implementing a limited control for the individual? Apart from granular consent, what other forms of control does Open Humans offer? GRANULAR CONSENT IS DIFFERENT FROM SPECIFIC CONSENT The GDPR requires consent to be freely given, specific, informed and unambiguous (see article 7 and recital 32). Granular consent is needed when one service is involved with multiple processing operations for multiple purposes. In such a case, consent is required for every purpose of processing. This is referred to as granular consent. Whilst closely related, granular consent is therefore different from specific consent. However, in the context of Open Humans, it is doubtful that a situation will arise where one research project will process data for more than one purpose, and thus require granular consent. Research projects work on the basis of a specific research question and/or purpose. RIGHT TO DATA PORTABILITY IS LIMITED TO DATA PROVIDED BY THE INDIVIDUAL The right to data portability is regarded to have the potential to boost the adoption of a system where individuals can recollect and integrate their personal data from different sources, 'as it guarantees individuals in the European Union a right to export their personal data in electronic and other useful formats'. However, Article 20 of the GDPR limits the right to data portability to the personal data that the data subject himself/herself has provided to the controller. Data provided by the data controller do not fall under the scope of the right to data portability. The argument that the right to data portability can lead to the breaking up from the different data silos is therefore less convincing.

    1. This article is a preprint and has not been certified by peer review [what does this mean?]. John M. Sutton 1Department of Biological Sciences, The University of Alabama, Tuscaloosa, AL 35487-0344Find this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for John M. SuttonJoshua D. Millwood 1Department of Biological Sciences, The University of Alabama, Tuscaloosa, AL 35487-0344Find this author on Google ScholarFind this author on PubMedSearch for this author on this siteA. Case McCormack 1Department of Biological Sciences, The University of Alabama, Tuscaloosa, AL 35487-0344Find this author on Google ScholarFind this author on PubMedSearch for this author on this siteJanna L. Fierst 1Department of Biological Sciences, The University of Alabama, Tuscaloosa, AL 35487-0344Find this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Janna L. FierstFor correspondence: janna.l.fierst@ua.edu

      This work has been peer reviewed in GigaByte (https://doi.org/10.46471/gigabyte.27), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer 1. Zhao Chen The authors should clarify why only Canu and Flye were selected instead of other long-read assemblers such as Raven, Redbean, Shasta, and Miniasm. Rationales should be given for why these two assemblers were selected. The same thing for MaSuRCA. It looks like you used MaSuRCA for hybrid assembly. Unicycler also contains a commonly used hybrid assembly pipeline. Therefore, you should also explain why MaSuRCA was selected for your study. A flow chart with all bioinformatics tools included should be provided to show more clearly how this entire study was carried out, including assembly, error correction, and analysis after assembly. More information about the quality of long reads should be provided, such as Phred quality scores, percentage of reads with Q30 or above, and average read lengths. QUAST should suffice for these quality analyses. Only testing simulated reads is not sufficient for making a solid conclusion since simulated reads cannot be treated as being equal to real reads or reflect basecalling errors in real reads. Since real reads are readily available on NCBI, real reads should also be tested. As your title didn’t mention anything about the fact that this study was solely based on testing simulated reads and your objective was to optimize the bioinformatic pipeline for processing Oxford Nanopore long reads, the experiments should be performed by including all conditions. Accordingly, real reads should also be tested, which could significantly improve the scientific quality of this study. Line12-13: This may not be true, since many studies have been published on how to assemble and error-correct Oxford Nanopore long reads to produce accurate genomes. The authors should describe why the present study is novel and what new findings were reported.

      Recommendation: Major Revisions

      Reviewer 2. Shanlin Liu The genome de novo assembly based on third generation sequencing (ONT in the current work) has been widely applied for plenty of organisms, including bacteria, plants and animals with various genome sizes, e g. the two recently published lungfish genomes (genome size of > 30 G) in Nature and Cell, and genomes of a broad range of species published in GigaScience, Scientific Data, Molecular Ecology Resources, et al. It is pretty easy to find the analysis pipeline or datasets that were used to obtain a high quality genome assembly in those published works. The authors generated multiple genome assemblies for four model species using different simulated datasets with varied sequencing depths and different assembly tools, and tried to provide useful guidance for those who are new to genome assembly. However, I am afraid that the current study contains some limitations in the results and conclusions that may mislead the readers, and I do recommend the authors reconstruct the manuscript and address those issues before its publication. First of all, a routine practice of genome assembly with long reads (either ONT or PacBio) includes a polishing step based on long reads itself using tools like Nanopolish, Medaka, Racon, et al. The author skipped this step in all of their analyses and directly evaluated the assembly errors based on the outputs generated from different combinations of datasets and software. It has little practical value whatever the results showed. Secondly, the four model species included in the current work can hardly represent a broad range of organisms – all have a genome size < 200 MB and low level of repetitive elements (< 30%). Hence, the analysis results from the current work offer scant guidance to those who work on organisms like plants, fishes, insects, mammals et al. For example, computing resources become the first hurdle for the genome assembly when working on > 100X ONT reads for the species with large genome size even if you can afford the sequencing. So, researchers would generate less data or prefer assemblers like WTDBG, NextDenovo, Falcon to obtain their genome assembly. In addition, the authors deem Caenorhabditis species as a highly heterozygous genome (0.7% according to their calculation), which is also open to question. Genomes of multitudinous insects and plants have a much higher heterozygous level. What’s more, the authors may want to pay attention to the news regarding the Sequel II sequencing platform recently released by PacBio Tech. As far as I know, it can provide inexpensive long read sequencing thanks to its huge improvement in sequencing throughput. Also, it also has a new release of a library preparation kit that can work on low amounts of DNA inputs. If so, what you stated in the instruction section may be incorrect. Beside the major issues mentioned above, there are some other minor ones listing as follows: Line 89. The authors may want to provide common names of those model species to improve readability of the manuscript. Line 119 Genome references and ONT reads were derived from different individuals or strains, and there are very low coverage ONT reads for E. coli. I am not sure whether those factors will influence the quality of simulation or not. The authors may add a caution to clarify these concerns.

      Line 24 A combination of experimental techniques? It is better to specify what experimental techniques. Line 128 Incorrect word format and C. latens missed. Line 141 How to define the best performance, the most contiguous assembly? Line 137 When you say “failed to produce an assembly”, does it mean that software failed to generate outputs or unexpected assembly results? Line 287 Supplement the BUSCO value of the reference TAIR10 Line 287 what do you mean by “combined approach”? Do you mean the method that corrects reads using Canu and assembles them using FLYE? Line 233 – 241 the “corrected” and “selected” dataset used in the Nematoda test were not applied to other organisms. Line 241 Canu correction could truncate some low quality reads or cut long reads into multiple pieces for speculated chimeric reads. I don’t think you can reach a conclusion that read length influences assembly quality using the current results. Line 242 Please rephrase this sentence and put Figure 5 and reference #36 in better positions to avoid misunderstanding. Line 337 – 341 duplicates to the content line 308 – 312, and conflicts between each other. Line 355 All the tested organisms have genome sizes < 200 MB, please specific this limitation instead of saying a broad range of organisms. Line 368 Low coverage may mislead readers, the authors cannot reach such a conclusion based on merely one single test. Line 461 which model was used – high accuracy? or flip-flop? Table 1. Too long a header, could move some of the content as table notes.

      Recommendation: Major Revisions

    1. This article is a preprint and has not been certified by peer review [what does this mean?]. Sherry Miller 1Division of Biology, Kansas State University, Manhattan, KS 665062Allen County Community College, Burlingame, KS 66413Find this author on Google ScholarFind this author on PubMedSearch for this author on this siteTeresa D. Shippy 1Division of Biology, Kansas State University, Manhattan, KS 66506Find this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Teresa D. ShippyPrashant S Hosmani 3Boyce Thompson Institute, Ithaca, NY 14853Find this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Prashant S HosmaniMirella Flores-Gonzalez 3Boyce Thompson Institute, Ithaca, NY 14853Find this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Mirella Flores-GonzalezWayne B Hunter 4USDA-ARS, U.S. Horticultural Research Laboratory, Fort Pierce, FL 34945Find this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Wayne B HunterSusan J Brown 1Division of Biology, Kansas State University, Manhattan, KS 66506Find this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Susan J BrownTom D’elia 5Indian River State College, Fort Pierce, FL 34981Find this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Tom D’eliaSurya Saha 3Boyce Thompson Institute, Ithaca, NY 148536Animal and Comparative Biomedical Sciences, University of Arizona, Tucson, AZ 85721Find this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Surya SahaFor correspondence: ss2489@cornell.edu

      This work has been peer reviewed in GigaByte (https://doi.org/10.46471/gigabyte.26), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer 1. Hailin LiuIt seems there are no very sound biological values in this manuscript, and more validation or comparative study are suggested to mine more meaningful conclusions.

      Reviewer 2. Mary Ann Tuli Please add additional comments on language quality to clarify if needed : The manuscript reads very well.

      Are all data available and do they match the descriptions in the paper?<br> No. Additional Comments: 1) Line 206. "Multiple alignments were performed with MUSCLE or MEGA7 " (figure 1). We need the output of MUSCLE (FASTA). We need the output of MEGA7 (FASTA)

      2) I note that MEGA7 has been used. I wonder why the newer release (MEGAX, March '21) was not used. Furthermore, the annotation protocol (dx.doi.org/10.17504/protocols.io.bniimcce) suggests using Mega7 or MegaX.

      3) Line 207. "phylogenetic analysis was done in MEGA7 or MEGA X" (figure 2). We need the files underlying the phylogenetic tree (newick) (figure 2).

      Are the data and metadata consistent with relevant minimum information or reporting standards? Yes. Nomenclature standards have been met. All cited INSDC accession numbers are publicly available.

      Is the data acquisition clear, complete and methodologically sound?<br> Yes. Curation workflow used for community annotation is available via protocols.io , nonetheless the manuscript includes comprehensive summary which is appropriate.

      Is there sufficient detail in the methods and data-processing steps to allow reproduction?<br> No. See "Are all data available and do they match the descriptions in the paper?" above. Once the additional files are made available I believe reproduction will be possible.

      Is there sufficient data validation and statistical analyses of data quality? Yes

      Is the validation suitable for this type of data? Yes

      Is there sufficient information for others to reuse this dataset or integrate it with other data? Yes

      Any Additional Overall Comments to the Author<br> Some of my comments/recommendations are pertinent to the other D. citri manuscripts currently under review.

    1. Now published in Gigabyte doi: 10.46471/gigabyte.25

      This work has been peer reviewed in GigaByte (https://doi.org/10.46471/gigabyte.25), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer 1. Xingtan Zhang and Yingying Gao

      The manuscript entitled “Characterization of chitin deacetylase genes in the Diaphorina citri genome” has an interesting subject. However, general organization and content of the manuscript is not well developed and needs extensive revision. It seems to me that the introduction does not offer a good amount of useful information or background knowledge for its reader. Some sentence is confused (as example see line 30).

      The topic " Materials and Methods" Lines 137-139. For the software, which one was used or both as there is only one tree plot? Why these species were chosen?

      Some useful experiments targeting the genes need to be conducted further. The data from RNA-seq should be confirmed by other experimental methods to support your result or conclusion. The results and conclusion are less informatic, and discussion is missing. The manuscript feels more like an informal work report rather than a research article.

      Figures and tables also need extensive revision. The figures and tables are important, but the figures and tables titles could be more descriptive and clear. And the notes of them should be separated from the title so that it can be more readable. Particularly, as a “note”, useful information in the figure needs explanation instead of repeating of method. Make sure the formats of all tables are consistent.

      The references could be improved (increase the number of references). There are many formatting mistakes, please check. It is fundamental to understand basic rules of reference, please review all them after that.

      Reviewer 2. Mary Ann Tuli Is the language of sufficient quality?<br> Yes. 1) Line 27. "Genomic" should be "genomic"

      Are all data available and do they match the descriptions in the paper?<br> No. Additional Comments: 1) Line 137. "Multiple alignments were performed using MUSCLE " We need the output of MUSCLE (FASTA).

      2) Line 137. "phylogenetic trees were constructed (figures 1 and 3) in MEGA7 or MEGAX." We need the files underlying the phylogenetic tree (newick). Please indicate which version of MEGA was used for each tree.

      3) Line 139. "Expression data from CGEN was visualized using the pheatmap package of R or Microscoft Excel" Please can you provide a file of the raw data used to produce the heatmap (figures 2a ) and the TMP graph (figure 2b).

      Are the data and metadata consistent with relevant minimum information or reporting standards? Yes. Nomenclature standards have been met. All cited INSDC accession numbers are publicly available.

      Is the data acquisition clear, complete and methodologically sound?<br> Yes. The curation workflow used for community annotation is available via protocols.io , nonetheless the manuscript includes a comprehensive summary which is appropriate.

      Is there sufficient detail in the methods and data-processing steps to allow reproduction?<br> No. See "Are all data available and do they match the descriptions in the paper?" above. Once the additional files are made available I believe reproduction will be possible.

      Is there sufficient data validation and statistical analyses of data quality? Yes

      Is the validation suitable for this type of data? Yes

      Is there sufficient information for others to reuse this dataset or integrate it with other data? Yes

      Any Additional Overall Comments to the Author:<br> 1) Citation [23] MUSCLE. https://www.ebi.ac.uk/Tools/msa/muscle/.

      • the website suggests users of MUSCLE cite DOI:10.1093/nar/gkz268

      Some of my comments/recommendations are pertinent to the other D. citri manuscripts currently under review.

    1. This article is a preprint and has not been certified by peer review [what does this mean?]. Sherry Miller 1Division of Biology, Kansas State University, Manhattan, KS 665062Allen County Community College, Burlingame, KS 66413Find this author on Google ScholarFind this author on PubMedSearch for this author on this siteTeresa D. Shippy 1Division of Biology, Kansas State University, Manhattan, KS 66506Find this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Teresa D. ShippyBlessy Tamayo 3Indian River State College, Fort Pierce, FL 34981Find this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Blessy TamayoPrashant S Hosmani 4Boyce Thompson Institute, Ithaca, NY 14853Find this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Prashant S HosmaniMirella Flores-Gonzalez 4Boyce Thompson Institute, Ithaca, NY 14853Find this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Mirella Flores-GonzalezLukas A Mueller 4Boyce Thompson Institute, Ithaca, NY 14853Find this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Lukas A MuellerWayne B Hunter 5USDA-ARS, U.S. Horticultural Research Laboratory, Fort Pierce, FL 34945Find this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Wayne B HunterSusan J Brown 1Division of Biology, Kansas State University, Manhattan, KS 66506Find this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Susan J BrownTom D’elia 3Indian River State College, Fort Pierce, FL 34981Find this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Tom D’eliaSurya Saha 4Boyce Thompson Institute, Ithaca, NY 148536Animal and Comparative Biomedical Sciences, University of Arizona, Tucson, AZ 85721Find this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Surya SahaFor correspondence: ss2489@cornell.edu

      This work has been peer reviewed in GigaByte (https://doi.org/10.46471/gigabyte.23), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer 1. Hailin Liu Is there sufficient data validation and statistical analyses of data quality?<br> No. The validation work is not revealed in the manuscript, such as the qRT-PCR experiment.

      Is the validation suitable for this type of data?<br> No. More validation work should be added instead of the RNA-seq data from the public database.

      Is there sufficient information for others to reuse this dataset or integrate it with other data? Yes

      Any Additional Overall Comments to the Author:<br> Formatting errors should be corrected, including the tables and the alignment method of words. The introduction and methods seemed to be too simple for readers. More biological meanings should be explained in the manuscript. The basic assessment of the utilized genome should be added.

      Recommendation: Major Revision

      Reviewer 2. Mary Ann Tuli Is the language of sufficient quality?<br> Yes. The manuscript reads very well.

      Are all data available and do they match the descriptions in the paper?<br> No. Additional Comments :<br> 1) Line 149. "Multiple alignments of the predicted D. citri proteins and their insect homologs were performed using MUSCLE We need the output of MUSCLE (FASTA).

      2) Line 151. Phylogenetic trees were constructed (figures 1 and 4) using full-length protein sequences in MEGA7or MEGAX. We need the files underlying the phylogenetic tree (newick). Please indicate which version of MEGA was used for each tree.

      3) Line 152. Gene expression levels were obtained from the Citrus greening Expression Network and visualized using Excel and the pheatmap package in R. Please can you provide a file of the raw data used to produce the heatmap (figure 2) and the Expression levels of UAP2 in male and female tissues (figure 5).

      Are the data and metadata consistent with relevant minimum information or reporting standards? Yes. Nomenclature standards have been met.

      All cited INSDC accession numbers are publicly available.

      Is the data acquisition clear, complete and methodologically sound? Yes. The curation workflow used for community annotation is available via protocols.io , nonetheless the manuscript includes a comprehensive summary which is appropriate.

      Is there sufficient detail in the methods and data-processing steps to allow reproduction?<br> No. See "Are all data available and do they match the descriptions in the paper?" above. Once the additional files are made available I believe reproduction will be possible.

      Is there sufficient data validation and statistical analyses of data quality? Yes

      Is the validation suitable for this type of data? Yes

      Is there sufficient information for others to reuse this dataset or integrate it with other data? Yes

      Any Additional Overall Comments to the Author:<br> 1) Line 147. Apollo version should be included in the other D citri manuscripts.

      2) Citation [26] MUSCLE. https://www.ebi.ac.uk/Tools/msa/muscle/.

      • the website suggests users of MUSCLE cite DOI:10.1093/nar/gkz268

      Some of my comments/recommendations are pertinent to the other D. citri manuscripts currently under review.

    1. This article is a preprint and has not been certified by peer review [what does this mean?]. Chad Vosburg 1Indian River State College, Fort Pierce, FL 349812Department of Plant Pathology and Environmental Microbiology, The Pennsylvania State University, University Park, PA 16802Find this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Chad VosburgMax Reynolds 1Indian River State College, Fort Pierce, FL 34981Find this author on Google ScholarFind this author on PubMedSearch for this author on this siteRita Noel 1Indian River State College, Fort Pierce, FL 34981Find this author on Google ScholarFind this author on PubMedSearch for this author on this siteTeresa Shippy 3KSU Bioinformatics Center, Division of Biology, Kansas State University, Manhattan, KSFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Teresa ShippyPrashant S Hosmani 4Boyce Thompson Institute, Ithaca, NY 14853Find this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Prashant S HosmaniMirella Flores-Gonzalez 4Boyce Thompson Institute, Ithaca, NY 14853Find this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Mirella Flores-GonzalezLukas A Mueller 4Boyce Thompson Institute, Ithaca, NY 14853Find this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Lukas A MuellerWayne B Hunter 5USDA-ARS, U.S. Horticultural Research Laboratory, Fort Pierce, FL 34945Find this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Wayne B HunterSusan J Brown 3KSU Bioinformatics Center, Division of Biology, Kansas State University, Manhattan, KSFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Susan J BrownTom D’Elia 1Indian River State College, Fort Pierce, FL 34981Find this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Tom D’EliaSurya Saha 4Boyce Thompson Institute, Ithaca, NY 148536Animal and Comparative Biomedical Sciences, University of Arizona, Tucson, AZ 85721Find this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Surya SahaFor correspondence: ss2489@cornell.edu

      This work has been peer reviewed in GigaByte (https://doi.org/10.46471/gigabyte.21), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer 1. Xingtan Zhang and Dongna Ma

      The manuscript by Vosburg et. al., systematically analyzed of the characteristics of the Wnt signaling genes in Diaphorina citri, and focusing on evolutionary history, expression patterns and potential functional. Finally, they also performed manual annotation of the Wnt signaling pathway. Indeed, this work would add important resource for the study of the evolutionary history of D. citri and Wnt signaling in this important hemipteran vector. The writing is acceptable. Even though, I still have some suggestion which may improve this manuscript.

      1. In the methods, the authors have indicated the process of identifying win genes, but the abstract describes it as Curation identification? I am confused whether this Wnt signaling genes in D. citri were identified by the author or whether the author just further analyzed it using the results already identified by others?
      2. The paper just did the identification of the win gene, evolutionary, and then the expression analysis using RNA-seq. It is recommended to also look at the chromosomal localization and mode of origin (e.g., tandem repeats)
      3. The Wnt signaling genes related to the hemipteran vector studied by the authors can be further verified by qPCR and then compared with the expression and function of other published insect-related genes for discussion.

      Major Revision.

      Reviewer 2. Mary Ann Tuli Is the language of sufficient quality?<br> Yes. The manuscript reads very well.

      Are all data available and do they match the descriptions in the paper?<br> No. Additional Comments:<br> 1) Line 176. "High scoring MCOT models were then searched on the NCBI protein database...." We need the list Wnt pathway genes with high scoring MCOT models.

      2) Line 178. "The high scoring MCOT models that had promising NCBI search results were used to search the D. citri assembled genome." We need the list of high scoring MCOT models which had promising NCBI search results..

      3) Line 179. "Genome regions of high sequence identity to the query sequence were investigated within JBrowse" We need the list of models with high sequence identity with the assembled genome.

      4) Line 184. "MUSCLE multiple sequence alignments of the D. citri gene model sequences and orthologous sequences were created through MEGA7" We need the output of MUSCLE (FASTA). We need the files underlying the phylogenetic tree (newick).

      5) I note that MEGA7 has been used. I wonder why the newer release (MEGAX, March '21) was not used. Furthermore, the annotation protocol (dx.doi.org/10.17504/protocols.io.bniimcce) suggests using Mega7 or MegaX.

      Instructions on how to upload these files is given under "Any Additional Overall Comments to the Author".

      Are the data and metadata consistent with relevant minimum information or reporting standards?<br> Yes. Nomenclature standards have been met. All cited INSDC accession numbers are publicly available.

      Is the data acquisition clear, complete and methodologically sound?<br> Yes. Curation workflow used for community annotation is available via protocols.io , nonetheless the manuscript includes comprehensive summary which is appropriate.

      Is there sufficient detail in the methods and data-processing steps to allow reproduction? No. See "Are all data available and do they match the descriptions in the paper?" above. Once the additional files are made available I believe reproduction will be possible.

      Is there sufficient data validation and statistical analyses of data quality? Yes

      Is the validation suitable for this type of data? Yes

      Is there sufficient information for others to reuse this dataset or integrate it with other data? Yes

      Any Additional Overall Comments to the Author:<br> Some of my comments/recommendations are pertinent to the other D. citri manuscripts currently under review.

      Minor Revision.

    1. This article is a preprint and has not been certified by peer review [what does this mean?].

      This work has been peer reviewed in GigaByte (https://doi.org/10.46471/gigabyte.20), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer 1. Feng Cheng. Crissy and co-authors annotated yellow genes in genome of Diaphorina citri, the vector of the Huanglongbing disease in citrus plants. The result is useful for close related area, and here I have some comments for the authors to improve the manuscript.

      1. The sections of introduction and background can be merged into one introduction section.

      2. Many sentences in the results section can be moved to the methods section.

      3. The methods section should be rewritten and re-organized as each analysis per paragraph.

      4. Some domain analysis and figures may be helpful for illustrating the evolution of important yellow genes in different insect species.

      Reviewer 2. Mary Ann Tuli Is the language of sufficient quality?<br> Yes. Line 18 'in-planta' should be in 'in planta' (in italics).

      Are all data available and do they match the descriptions in the paper?<br> No. Additional Comments: 1) line 224. "The MCOT protein sequences were used to search the D. citri genomes" We need the list of MCOT protein sequences that were used.

      2) Line 228. "A neighbor-joining phylogenetic tree of D. citri yellow protein sequences along with was created in MEGA version 7 using the MUSCLE multiple sequence alignment" a) Along with what? There are some words missing. b) We need the output of MUSCLE (FASTA). c) We need the files underlying the phylogenetic tree (newick).

      3) I note that MEGA7 has been used. I wonder why the newer release (MEGAX, March '21) was not used. Furthermore, the annotation protocol (dx.doi.org/10.17504/protocols.io.bniimcce) suggests using Mega7 or MegaX.

      4) Line 233. "Comparative expression levels of yellow proteins throughout different life stages (egg, nymph, and adult) in Candidatus Liberibacter asiaticus (Clas) exposed vs. healthy D. citri insects was determined using RNA-seq data and the Citrus Greening Expression Network (http://cgen.citrusgreening.org)." Results are presented in Fig 3(a) and Fig 3(b) We need the raw data underlying these figures.

      Are the data and metadata consistent with relevant minimum information or reporting standards?<br> Yes. Nomenclature standards have been met. All cited INSDC accession numbers are publicly available.

      Is the data acquisition clear, complete and methodologically sound?<br> Yes. Curation workflow used for community annotation is available via protocols.io , nonetheless the manuscript includes comprehensive summary which is appropriate.

      Is there sufficient detail in the methods and data-processing steps to allow reproduction?<br> No See "Are all data available and do they match the descriptions in the paper?" above. Once the additional files are made available I believe reproduction will be possible.

      Is there sufficient data validation and statistical analyses of data quality? Yes

      Is the validation suitable for this type of data? Yes

      Is there sufficient information for others to reuse this dataset or integrate it with other data? Yes

      Any Additional Overall Comments to the Author:<br> Citation [39] is not complete. It should be MCOT protein database.

      Some of my comments/recommendations are pertinent to the other D. citri manuscripts currently under review.

    1. Abstract

      This work has been peer reviewed in GigaByte, which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer 1: Aaron Shafer Is there sufficient detail in the methods and data-processing steps to allow reproduction?

      No. I would include all flags for assemblies even if default; unclear how the 10x + Illumina data were integrated (if at all) - see comments below.

      Is there sufficient data validation and statistical analyses of data quality?<br> Yes. I suppose BUSCO and gene number is a form of validation.

      Is there sufficient information for others to reuse this dataset or integrate it with other data?<br> No. See comment below; while the short-read data is great, the genomic resource I likely would reassemble for a variety of reasons outlined in Additional Comments.

      Any Additional Overall Comments to the Author: The paper is well written, and I have no comments about the the content - well done here. My main concern lies with the genome resources - and in this case I would likely use the raw data, rather than the assemblies provided. I offer my rationale and suggestions:

      My lab was heavily pushed by a colleague towards the use of Meraculous in our short-read assembly of mammal genomes ( https://jgi.doe.gov/data-and-tools/meraculous/ ) ; this is because it’s really designed for short-read assemblies of big genomes (i.e. no addition of mate-pair) AND it performs very well in the Assemblathon metrics https://academic.oup.com/gigascience/article/2/1/2047-217X-2-10/2656129 - notably Figure 16-18 you start to see clear differences between meraculous and say soapdenovo. Thus for just the Illumina data I would very much like to see a more appropriate assembly explored as stats like N50 and no. scaffolds will likely improve considerably with the appropriate methods.

      Likewise, it’s very unclear in the methods how M. r. arvicoloides was assembled: I see SUPERNOVA for the 10X data (great), and probably soapdenovo for the Illumina data (see above). But how were they combined? This sequencing strategy is really designed for a hybrid assembly (see for example DGB2OlC https://github.com/yechengxi/DBG2OLC) this is appropriate for 10X data and really does work! But there are others.

      Note M. agretus that has an identical sequencing strategy to M. r. arvicoloides almost has ~3% the total scaffolds – follow whatever they did! And I will say, while the authors state their genome is on par with other Microtus, this appears true by Table 3, only M. agretus currently has an assembly that I think is at current standards. The level of fragmentation and low BUSCO scores really support re-visiting the assembly suggestions, as I think the current .fasta will be of limited utility in a population or comparative genomics study.

      The gene number is pretty high for a mammal and I worry that’s due to fragmentation. It would be reasonably to only annotate scaffolds >10Kb or 50KB, but then there’s not much of a genome left. Ideally the bulk of your genome (>>90%) would fall on these scaffolds. There is really no sense annotation your small fragments (have you tested for contamination? Note NCBI will do this before allowing for it to be deposited so I suggest it).

      You also align your data to mt genome, this is different than assembling it. You could assemble it (e.g. https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-017-1927-y) and that might be interesting to see if there any differences

      I wish I could be more positive; an assembly like Mercaculous would take a week or so, and so would the hybrid approach, but would be worth it based on my experience with these data.

      Recommendation: Major Revision

      Reviewer 2. Joana Damas.

      Any Additional Overall Comments to the Author:<br> The genomes presented in these work will be extremely valuable tool for Microtus related research. The manuscript is very clear and easy to follow. I have, however, a couple of comments that I hope will further improve it.

      (1) Line 123: I believe more details on the measures used for the selection of the best M. r. macropus are needed. Even though the contiguity of the Discovar genome assembly is higher than the ones generated with SOAPdenovo, the BUSCO score is relatively low (54.5% versus 84% in M. r. arvicoloides, e.g.). Were the BUSCO scores for the other assemblies even lower? Is the Discovar assembly size closer to the estimated genome size?

      (2) Line 131/251: Was there any genome structure verification step for the M. montanus genome assembly? For instance, which percentage of the Illumina reads could be mapped back to the finished genome assembly?

      (3) Line 131/251: Was there a reason not to use a published reference-guided assembly method (e.g. RaGOO and those listed therein) for the assembly of M. montanus genome? These could maybe further improve the assembly or help identify misassemblies. (4) Line 180: the high difference between BUSCO scores for each M. richardsoni subspecies makes me believe that the completeness of the genomes is quite different and the fraction of the genome within repeats might be underrepresented in M. r. macropus and that the subspecies values might be closer than noted here. It is, however, difficult to depict phylogenetic relatedness from Fig. 1 for the other species, for non-experts as myself. It would be helpful to have a phylogeny next to the graph showing species relationships. (5) Please verify Tables 1 and 2. The statistics presented for M. r. macropus do not match for N50 and longest scaffold size.

      Recommendation: Minor Revision

    1. This article is a preprint and has not been certified by peer review [what does this mean?]. Jaclyn Smith 1University of OxfordFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Jaclyn SmithFor correspondence: jaclyn.smith@cs.ox.ac.ukYao Shi 1University of OxfordFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteMichael Benedikt 1University of OxfordFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteMilos Nikolic 2University of EdinburghFind this author on Google ScholarFind this author on PubMedSearch for this author on this site

      This work has been peer reviewed in GigaScience, which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer 1: JianJiong Gao

      In this manuscript, the authors introduced a tool named TraNCE for distributed processing and multimodal data analysis. While the topic and tool are interesting, the writing can be improved. The current manuscript reads more like a technical manual than a scientific paper.

      For example, in the background, the discussion on data modeling in the contexts of multi-omics analysis and distributed systems is extensive, but the writing can be better organized. The examples are helpful, but they are very technical and can be hard to follow. It would be good if the main challenges can be summarized on a high level. It might also be useful to have an example analysis use case to lead the technical discussion on data modeling.

      It is also unclear how are the targeted users of the tool and why distributed computing is needed. For example, in application 1 & 2, it is unclear why distributed computing is necessary.

      Reviewer 2. Umberto Ferraro Petrillo First review:

      The authors propose a new framework, called TraNCE, for automating the design of distributed analysis pipelines over complex biomedical data types. They focus on the problem of unrolling references between different datasets (which can be very large), assuming that these datasets contain complex data types consisting of structured objects containing collections of other objects. By using TraNCE, it is possible to formulate queries over collections of nested data using a very high-level declarative language. Then, these queries are translated by TraNCE in Apache Spark applications able to implement those queries in an efficient and scalable way. Apart from a quick description of the TraNCE framework and of the declarative language it supports, the paper also includes a vast collection of examples of multi-omics analyses conducted using TraNCE on real-world data. I found the contribution proposed by this paper to be very actual. Indeed, there is a flourishing of public multi-omics databases. But, their huge volumes make their analysis difficult and very expensive, if not approached with the right methodologies. Distributed analysis frameworks like Spark can be of help, but they are often not easy to be mastered, especially for those not having deep distributed programming skills. So, TraNCE looks like a very much need contribution on this topic. However, I have some remarks. The high-level querying language supported by TraNCE is not original because, as far as I understand, it has been presented in a previous paper [1] (which has been written by almost the same authors and that has been correctly referenced to in this submission). Even the TraNCE framework is not completely original because its name appears as the name of the project containing the code presented in [1]. Finally, at least one of the experiments presented in [1] seems to have been run on the same Hadoop installation used for the experiments presented in the current submission, and has involved the same datasets from the International Cancer Genome Consortium. So, I am a bit confused about what it is original in this new submission and what has been borrowed from [1]. My advice is to definitely clarify this point.

      Another issue that I think should be addressed is about the proposed framework being scalable. The authors state that the framework supports scalable processing of complex datatypes, however, no evidence is brought about this claim. The several different experiments that are reported seem to focus more on the expressiveness of the proposed language while no experiment about the scalability of the generated code is provided when run on a computing architecture of increasing size. I think we may agree on the fact that using Spark does not means that your code is scalable, neither I think it is enough to say that the scalability of TraNCE has been proved in [1]. So, I would suggest to elaborate also on this. To be honest, I am a bit skeptical about the practical performance of the standard compilation route. I think that when applied to very large datasets it is likely to return huge RDDs that could require very long processing times. Instead, the shredded compilation route looks much clever to me. Could you elaborate further on this difference, especially according to the results of your experimentations? I also disagree with your idea of not describing how data skewness is dealt with in your framework. It is indeed one of the main cause for bad performance of many distributed applications so it would be interesting to know how did you manage this problem in your particular case. On the bright side, I really appreciated the flexibility of the proposed framework, as witnessed by the vast amount of examples provided, as well as its positive implications on the analysis of multi-omics databases.

      Finally, the English of the manuscript is very good and I have not been able to find any typos so far.

      [1] Jaclyn Smith, Michael Benedikt, Milos Nikolic, and Amir Shaikhha. 2020. Scalable querying of nested data. Proc. VLDB Endow. 14, 3 (November 2020), 445-457.

      Re-review: I appreciated the robust revision done by the authors and think the paper is now ready to be published

  6. Aug 2021
    1. ABSTRACT

      This work has been peer reviewed in GigaByte (https://doi.org/10.46471/gigabyte.16), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer 1. Inge SeimIs the language of sufficient quality? No. The authors need to polish their English further. This is particularly obvious in the Abstract and is likely to result in an unwarranted lower readership of the work.

      Are all data available and do they match the descriptions in the paper?<br> Yes. I want to commend the authors for sharing data and associated code.

      Is there sufficient data validation and statistical analyses of data quality?<br> Not my area of expertise.

      Any Additional Overall Comments to the Author<br> • R2 should be R^2 (that is, please superscript the '2'). • The sentence 'Further comparison between sequencing platforms would be useful for for exploration using as similar amplification conditions as possible. This data being provided as one such benchmark' at the end of Results is vague and needs to be rewritten. • You need to more clearly state that you do not recommend to combine MGI and Illumina data sets for metabarcoding -- unlike e.g. BGISEQ-500 and Illumina RNA-seq/short-insert WGS data which can be readily combined.

      Recommendation: Minor Revision

      Reviewer 2. Petr Baldrian Are all data available and do they match the descriptions in the paper?<br> No. I was not able to locate the items listed as references (26) and (27). Due to this, I was not able to fully evaluate the paper.

      Are the data and metadata consistent with relevant minimum information or reporting standards?<br> No. I was not able to locate the data, see above.

      Is the data acquisition clear, complete and methodologically sound?<br> No. More details on sampling (mode of sampling, area sampled, depth sampled, sample size, sample handling) is missing. Information on number of repetitive extractions of DNA and the size of sample for extraction is missing. Protocols of amplification and barcoding are referenced as (27), but I was not able to locate this reference. These details have to be provided in the text for both types of sequencers.

      Is there sufficient detail in the methods and data-processing steps to allow reproduction?<br> Yes. For fungal ITS, the ITS region should be extracted before annotation.

      Is there sufficient data validation and statistical analyses of data quality?<br> No. The authors do not report how do they deal with sequences of fungi that produce amplicons longer than 350 bases that can not be pair-end joint in the 2x200 base runs. Even the MiSeq 2x250 runs miss some fungal taxa (though not very many) and here the situation is still worse. For the length distribution of fungal ITS, please consult the UNITE database.

      Is the validation suitable for this type of data?<br> No. There should be additional validations including the analysis of those OTUs that are abundant in one setup but missing in another one (if any).

      Is there sufficient information for others to reuse this dataset or integrate it with other data?<br> No. The metadata, supposedly in reference (26) are impossible to locate.

      Any Additional Overall Comments to the Author<br> I believe that this is a very good attempt to test the novel platform with fungal metabarcoding. If all required information is provided, I believe that this can be both an interesting paper and a valuable dataset.

      Recommendation: Reject (Unsound or Unusuable)

      Reviewer 2. Re-review. I have now carefully read the revised version of this manuscript and I am happy with the changes that the authors implemented as a response to my comments and the comments of the other reviewer. The paper is now much more clear, especially in the methodological section and the limitations of the use of the novel sequencing platforms/formats is sufficiently discussed.

      Minor comments that should be made in the present paper:

      L58: change "bacteria" to "bacterial" L65-66: the last part of this long sentence is difficult to comprehend and should be rephreased. I suggest to divide the long sentence into two L68-69: change "produces" to "produced" L84: delete "in" L98: please explain the abbreviation "ONT", likely "Oxford Nanopore Technologies" L162: the detail of the amplification methods should be expanded at least stating the primer pairs (names and sequences) used and targeted molecular markers; from the text it appears as if ITS2 was the marker selected, yet lines 361 and 366 discuss length differences in ITS1 L246: replace "common fungi several species" with "common fungal species" L248-251: the misclassification of fungal taxa was not due to the bad performance of the sequencing platform, it was because of the low variability of the ITS2 marker. I suggest to change the text to state that genus level assignment was reached for these taxa since multiple species had the same ITS2 sequence L264-265: the main reason is that the PCR bias (preferential PCR amplification of certain templates) skews the representation of taxa if the DNA is mixed prior to amplification L331-346: this section is unclear; it should be specified which primers (primer names and sequences) with what barcodes were used for each conditions; if different primer pairs were used for different sequencing platforms, it is unclear what is the use of this comparison. This should be either clarified and explained all this section may be removed. L381: delete "so" L387-392: I suggest that this part is either removed or it is clearly described why the authors are sure that PCR replicates are not necessary (which is against all present recommendations). While the increasing fidelity of polymerases can be a fact, the main problems with parallel PCR is not errors (due to low fidelity) but random effects where primers align to templates with random frequencies. This statistical effect is impossible to handle by increasing polymerase fidelity while it is easily handled by PCR replication. L424-426: This statement is rather obvious, I suggest to delete it.

    1. Now published in Gigabyte doi: 10.46471/gigabyte.14 Tianlin Pei 1Shanghai Key Laboratory of Plant Functional Genomics and Resources, Shanghai Chenshan Botanical Garden, Shanghai Chenshan Plant Science Research Center, Chinese Academy of Sciences, Shanghai, 201602, China2State Key Laboratory of Plant Molecular Genetics, CAS Center for Excellence in Molecular Plant Sciences, Shanghai Institute of Plant Physiology and Ecology, Chinese Academy of Sciences, Shanghai, 200032, ChinaFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteMengxiao Yan 1Shanghai Key Laboratory of Plant Functional Genomics and Resources, Shanghai Chenshan Botanical Garden, Shanghai Chenshan Plant Science Research Center, Chinese Academy of Sciences, Shanghai, 201602, ChinaFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteJie Liu 1Shanghai Key Laboratory of Plant Functional Genomics and Resources, Shanghai Chenshan Botanical Garden, Shanghai Chenshan Plant Science Research Center, Chinese Academy of Sciences, Shanghai, 201602, ChinaFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteMengying Cui 1Shanghai Key Laboratory of Plant Functional Genomics and Resources, Shanghai Chenshan Botanical Garden, Shanghai Chenshan Plant Science Research Center, Chinese Academy of Sciences, Shanghai, 201602, ChinaFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteYumin Fang 1Shanghai Key Laboratory of Plant Functional Genomics and Resources, Shanghai Chenshan Botanical Garden, Shanghai Chenshan Plant Science Research Center, Chinese Academy of Sciences, Shanghai, 201602, ChinaFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteBinjie Ge 1Shanghai Key Laboratory of Plant Functional Genomics and Resources, Shanghai Chenshan Botanical Garden, Shanghai Chenshan Plant Science Research Center, Chinese Academy of Sciences, Shanghai, 201602, ChinaFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteJun Yang 1Shanghai Key Laboratory of Plant Functional Genomics and Resources, Shanghai Chenshan Botanical Garden, Shanghai Chenshan Plant Science Research Center, Chinese Academy of Sciences, Shanghai, 201602, China2State Key Laboratory of Plant Molecular Genetics, CAS Center for Excellence in Molecular Plant Sciences, Shanghai Institute of Plant Physiology and Ecology, Chinese Academy of Sciences, Shanghai, 200032, ChinaFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteFor correspondence: yangjun@csnbgsh.cn zhaoqing@cemps.ac.cn

      Reviewer 1. C Robin Buell Is the language of sufficient quality?<br> No. The manuscript could be improved with a round of editing for grammar.

      Is there sufficient detail in the methods and data-processing steps to allow reproduction?<br> No. The sequencing, assembly and annotation methods need more details.

      Any Additional Overall Comments to the Author:<br> This manuscript describes the sequencing, assembly, annotation, and analysis of the Tripterygium wilfordii genome. T. wilfordii is a medicinal plant that has long been used in traditional medicine due to its production of alkaloids and triterpenoids; the focus of this study was identify cytochrome P450s involved in biosynthesis of the triterpenoid celastrol.

      Based on the genome assembly metrics, the authors generated a robust representation of the genome sequence. Improvements in the analyses of the genome and in the manuscript would greatly strengthen confidence in the assembly. The authors should provide these metrics and additional information to the manuscript:

      More details on the error correction of the assembly. Based on the methods, both nanopore and Illumina WGS reads were used, however, this is not explicit nor are any metrics of the error correction provided.

      Specifically it is not discussed how the nanopore reads were assembled. A company is cited for the genome assembly. Information on what assembly software that was used must be provided.

      Every software program used, its version, and the parameters used should be provided in the methods. This is often missing.

      The quality of the genome should be confirmed using both alignment of the whole genome shotgun reads and the mRNAseq data. Specific metrics should be provided include: total and percentage of reads that mapped, read pairs that mapped in the correct orientation.

      No details on read quality assessment or trimming are provided

      The CEGMA results should be omitted, this program has been deprecated.

      Line 337: The DNA was sheared not interrupted into fragments Line 343: More details on the library preparation and sequencing for the nanopore reads.

      Do the authors know the genome size of the species based on flow cytometry? Do you know the number of chromosomes that this species has? This should be stated and discussed in context of the assembly size and number of pseudochromosomes

      The genome wide identification of the CYP450 candidates was difficult to follow. This section should be revised so that it is clear how the authors identified their candidate genes. Potentially adding a supplemental figure would be helpful. I found the coexpression pattern extremely difficult to follow. I would not call coexpression patterns coexpression profiles. Specifically I did not understand the sentence on line 202 “However, no….”. Essentially this is just sub-functionalization at the expression level, not that there are two independent pathways.

      The evolution section should be expanded. How divergent are T. wilfordii from P. trichocarpa and R. communis?

      Table 1: Index should be replaced with metric

      Figure S1: What k-Mer was used in the analysis? Figure S5: Unclear what is on the X or y axis. Expand the figure legend.

      The manuscript should be proofed for grammar as there are numerous sentences that need editing.

      Recommendation Major Revision

    2. Tripterygium wilfordii

      Reviewer 2. Xupo Ding Is the language of sufficient quality? The language of one third paragraph is sufficient quality

      Comments This manuscript provided the reference genome assembly of T. wilfordii by using a combined sequencing strategy(Nanopore, Bionano, Illumina, HiSeq, and Pacbio)and functions of two CYP450 genes were identified with enzyme assays in vivo and in vitro. This research also provided valuable information to aid the conservation of resources and help us reveal the evolution of Celastrales and key genes involving in celastrol biosynthesis. However, it should be well improved about the text.

      1. The comma in the title is suggested to remove.

      2. Nothing in biology makes sense in the light of evolution (T.Dobzhansky), the abstract were not presented vitial results in the manuscript, such as gene numbers, repeat percentage, comparative evolutional analysis. The contribution or sense of T.wilfordii genome were not limited in celastrol biosynthesis in Line38-39, it also provide valuable information to aid the conservation of resources and help us reveal the evolution of Celastrales and key gene involving in celastrol biosynthesis.

      3. Nanopore is not an appropriate key word, the equal platforms, Illumina, Bionano, Pacbio and Hi-C, were also presented in the manuscript.

      4. Tales of legendia mentioned (line 59-61) in scientific paper might be controversial.

      5. Line 61-63 were described colloquially. Please consider replace it with The extraction of T.wilfordii bark have been used as a pesticide from ancient times in China, which recoded in the Illustrated Catalogues of Plants published in 1848 firstly.

      6. Line 103-104 is not coherent with the above sentence.

      7. Line 112, the N comprising rate is 0% ?

      8. Line 117-118, Both results indicated that the presented genome is relative complete. This is uncommon and definitely worth negotiating over. This sentence might be contained in the section of discussion even it is credible.

      9. Line 145, the full name should be entered for the mentioning firstly.

      10. Line 150-155, Copia and Gypsy were missed.

      11. The gene families contained TwCYP712K1 and TwCYP712K2 was expanded or contracted in the CAFÉ analysis?

      12. WGCNA might present much more reliable evidence for candidate of TwCYP712K1 and TwCYP712K2, even the pearson's correlation coefficients is the simplified version of WGCNA.

      13. The full peak should be presented in figure 5A and 5B. The data of NMR and MS uploading as the additional file will be enhance credibility of enzyme function.

      14. Line 269-272, the evolution analysis in Figure 2B indicated that the original time of T.wilfordii is earlier than the original times of P.trichocarpa and T.communis, is this suggested that the functions of TwCYP712K1 and TwCYP712K2 has been fused in the evolution of Malpighiales and Celastrales in Figure 6? If the authors insisted these two P450 came from the common ancestor, syntenic analysis of TwCYP712K1 and TwCYP712K2 within T.wilfordii and A.trichopoda, O.sativa or V.vinifera might be credible.

      15. The latin name should be contained complete specie name in all figures, such as T.wil should be replaced with T.wilfordii.

      16. Line322, transcriptom is transcriptome.

      17. Line330, please add the longitude and latitude.

      18. Please revise the English of total pages except the line 327- 509 and 526-599. line 327-509 might come from the concluding report of sequence project.

      19. Line 606. LAST might be BLAST?

      20. I noticed that the genome of T.wilfordii genome have been published on Nature communication in Feb. 2020. So I suggest adding some comparison to their assembly or triptolide synthesis and cite this paper. Mentioning these contents will look fair and also will highlight the special celastrol synthesis of the one you present here.

      Major Revision

    1. Now published in Gigabyte doi: 10.46471/gigabyte.13

      This work has been peer reviewed in GigaByte, which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer 1. Yoonjoo Choi Is the language of sufficient quality? Yes There are some minor typos. Perhaps this would not be a matter in other systems or viewer - all "fi" do not appear on my computer (Mac OS Preview), e.g. "affinity" -> "a inity", "artificial" -> "arti cial".

      Is there a clear statement of need explaining what problems the software is designed to solve and who the target audience is? Yes. The purpose of this software is clearly stated and it will be very useful for researchers in relevant research fields.

      Yes. The author recommended running this package on Linux machines, though it is written in Python. It would be great for a non-linux user to run TEPITOPE and BasicMHC1 (for a quick epitope screen). I pip-installed it on both Ubuntu and Mac OS (just to see whether I can run TEPITOPE and BasicMHC1). The installation on Ubuntu was very easy and running fine. The Mac OS installation failed, but perhaps not the trouble of epitopepredict (brew installed Python 3.9.0).

      Have any claims of performance been sufficiently tested and compared to other commonly-used packages? Yes. (Definitely not mandatory at all but) It would be great this package also provides a wrapper for the IEDB tools.

      Recommendation: Minor Revisions.

      Reviewer 2. Jayaraman Valadi. Is the language of sufficient quality? Yes. There are lot of spelling mistakes. Must be corrected before acceptance.

      Is there a clear statement of need explaining what problems the software is designed to solve and who the target audience is? Yes. This is clearly explained In the manuscript

      Is the source code available, and has an appropriate Open Source Initiative license (https://opensource.org/licenses) been assigned to the code? Yes. The source code is available on Github and it works as expected.

      Is installation/deployment sufficiently outlined in the paper and documentation, and does it proceed as outlined? No. The software depends on a number of external soft wares. Installation of the same need to be explained clearly in the manuscript.

      Is the documentation provided clear and user friendly? Yes. Overall the documentation is good. Doc-Strings need minor improvements to make it more comprehensive.

      Is there a clearly-stated list of dependencies, and is the core functionality of the software documented to a satisfactory level? Yes. This is well explained in manuscript.

      Have any claims of performance been sufficiently tested and compared to other commonly-used packages? Yes. Adding a note on comparing the performance of different methods would be useful.

      Additional Comments: The software developed is a python wrapper for a number of epitope prediction methods which are available. Unified architecture allows users to have easy access to all methods and compare the results of each method. Some of these methods/models have to be manually installed before the user can access it through the python wrapper. A new model trained by the authors has also been added additionally. users can utilize this prediction model without having to install any additional dependencies. Salient Features The software also supports visual comparison of predictions Users can select a target protein for epitope scanning users can prediction putative mhc1 and mhc2 epitopes using various predictive models using the python wrapper. Selection of best predictions possible Visual comparison of predictions from different predictive models possible.

      Highlights the positions of putative epitopes on the target protein sequence

      Overall the manuscript and software are quite comprehensive and can be accepted after minor revisions.

      Recommendation: Minor Revisions

    1. doi: 10.1093/gigascience/giaa146

      Reviewer 2. Mile Šikić Reviewer Comments to Author: In their paper Murigneux et. al. made a comparison of three long-read sequencing technologies applied to the de novo assembly of a plant genome, Macadamia jansenii. They generated sequencing data using Pacific Biosciences (Sequel I), Oxford Nanopore Technologies (PromethION), and BGI (single-tube Long Fragment Read) technologies. Sequenced data are assembled using a bunch of state of the art long-read assemblers and hybrid Masurca assembler. Although paper is easy to follow, and this kind of analysis is more than welcomed I have several major and minor concerns. Major concerns 1) The authors use 780 Mbps as the estimated size of the genome. Yet, this is not supported by data. In chapter "Genome size estimation", they present the genome size estimation using K-mer counting, but these sizes are 650 Mbps or less 2) Since the real size of the genome is unknown, It would be worthwhile if authors provide analyses such as those enabled by KAT (Mapleson et al., 2017), which compares the k-mer spectrum of the assembly to the k-mer spectrum of reads (preferably Illumina). For control of the misassembled contigs, authors also might align larger contigs obtained using different tools to compare similarity among them (e.g., using tools such as Gepard or similar). 3) The authors compare assemblies with "Illumina assembly", but it is not clear what that means and why they consider this as a valid comparison. 4) Although they started ONT data analysis with four tools, they perform further analysis on just two tools (Flye and Canu). In addition, for PacBio data, they use three tools (Redbean, Fly and Canu). It is not clear why the authors chose these tools. Canu and Fly have larger N50, larger total length, and the longest contigs. However, this does not take into account possible misassembles. Assemblers might have problems with uncollapsed haplotypes, which can result in assemblies larger than expected. In their recent manuscript, Guiglielmoni et al (https://doi.org/10.1101/2020.03.16.993428) showed that Canu is prone to uncollapsed haplotypes. Also, in this manuscript is presented that using PacBio data Canu produces much longer assemblies than other tools (1.2 Gbps). Therefore, the longer total size of a assembly cannot guaranty a better genome. Furthermore, on ONT data Raven has the second-best initial Busco score (before polishing), and its assembled genome consists of the least number of contigs. Therefore, I deem that the full analysis needs to performed using all tools for both Nanopore and Pacbio data. 5) It would be of interest to a broad community if authors add the computational costs in total cost per genome for each sequencing technolgy. They might compare their machines with AWS or other cloud specified configurations. Besides, it is not clear which types of machines they used. Information from supplementary materials such as GPU, large memory, HPC is not descriptive enough. Minor comments: 1) The authors use the published reference genome of Macadamia integrifolia v2 for comparison. It would be interesting if they can provide us with information about sequencing read technology used for this assembly. 2) The authors mentioned that the newer generation of PacBio sequencing technology (Sequel II) which provides higher accuracy and lower costs. It would also be worth to mention the newer generations of assembly tools such as Canu 2.0, Raven v1.1.5 or Flye Version 2.7.1 It is worth considering Racon for polishing with Illumina reads too. Yet, this is not a requirement, because authors already use state of the art tools.

    2. Now published in GigaScience

      Review 1. Cecile Monat. Reviewer Comments to Author: Introduction part:

      • It would be nice to put the genome size and to indicate the reference genome that is already sequenced and assembled for Macadamia, just to put a context for the people who are not familiar with Macadamia. Methods part:
      • ONT library preparation and sequencing part:
        • What was the reason to used both MinION and PromethION and not only PromethION?
        • For what reason didn't you use the same version of MinKNOW to assemble the MinION (MinKNOW (v1.15.4)) and PromethION (MinKNOW (v3.1.23)) data?
      • Assembly of genomes part:
        • Is there a reason for doing 4 iterations of Racon? And not 3 or 5?
        • Maybe you should precise that Racon is used as an error-correction module and Medaka to create the consensus sequence.
        • "Hybrid assembly was generated with MaSuRCA v3.3.3 (MaSuRCA, RRID:SCR_010691) [32] using the Illumina and the ONT or PacBio reads and using Flye v2.5 to perform the final assembly of corrected mega-reads" this sentence is not very clear to me. Does it mean that you have first used ONT/PacBio data + Illumina on MaSuRCA software to generate what they call "super-reads" and then from this data you used Flye to get the final assemblies?
        • as I understood stLFR is similar to 10x genomics, why not compare this technology data too?
      • Assembly comparison part:
        • "We compared the assemblies with the published reference genome of Macadamia integrifolia v2 (Genbank accession: GCA_900631585.1)." First, I think it is important to add the reference paper. Secondly, I cannot see where did you compare your assemblies with the one published? For me, you compared all your assemblies between each other, but I cannot find any other assembly.
        • when you said "Illumina assembly" do you refer to the Macadamia integrifolia assembly? If so, please clarify it in the rest of the paper, and add the data for this reference genome in your figures. Results part:
      • ONT genome assembly part:
        • Is there any interested to combine MinION and PromethION data? Are there any advantages to combining it?
        • "The genome completeness was slightly better after two iterations of NextPolish (95.5%) than after two iterations of Pilon (95.2%) (Sup Table 1)." Here I would precise that it is the case for the Flye assembly, but surprisingly (at least for me?) after two iterations of NextPolish on the Canu assembly, the results were a little less good as with one iteration. So, depending on the assembler you use, the number of iteration needed might be different.
        • "As an estimation of the base accuracy, we computed the number of mismatches and indels as compared to the Illumina assembly." Here I am not sure which assembly you refer to when you use the "Illumina assembly" term. Do you refer to the Macadamia integrifolia assembly or to the MaSuRCA hybrid assembly? If you refer to the last one, I would suggest using the word hybrid assembly instead of Illumina assembly, it might be confusing.
        • Why not using the Pilon and NextPolish step on the ONT+Illumina (MaSuRCA) assembly since they are tools dedicated to long and short reads polishing?
      • PacBio genome assembly part:
        • Why did you use FALCON as the assembler for PacBio but not for ONT? If I am correct, it is not uniquely build to work on PacBio data but is ok for all long-reads technologies.
        • "Two subsets of reads corresponding to 4 SMRT cells and equivalent to a 43× and 39× coverage were assembled using Flye." why choosing Flye for this analysis? I'm also wondering if this part is necessary since afterward, you do the ONT equivalent coverage which is more interesting for the comparison of the technologies.
        • Comment on the structure: for this paragraph, I would prefer to have first the result with the same assemblers as with the ONT data, and then an explanation of why you choose to perform also a test with FALCON and then the FALCON results.
      • stLFR genome assembly part:
        • Supernova might have been used on PacBio data as well, why not?
        • why not trying to complement PacBio data with stLFR as you did with ONT? Are there any incompatibilities? Discussion part:
      • "The amount of sequencing data produced by each platform corresponds to approximately 84× (PacBio Sequel), 32× (ONT) and 96× (BGI stLFR) coverage of the macadamia genome" I would have put this information into the Results part, but it's only my preference.
      • "For both ONT and PacBio data, the highest assembly contiguity was obtained with a long-read only assembler as compared to an hybrid assembler incorporating both the short and long reads." I would suggest using the term "long-read polished" instead of "long-read only" since the assembly with the best contiguity integrates the Illumina data for the polishing. Tables and figures:
      • Table 2:
        • For this figure, if I understood properly you have chosen the best assembly of each technology. If I am right, then please precise it in the title of the figure. -Figure 1:
        • If I understood properly and here when you write "Base accuracy of assemblies as compared to Illumina assembly" you refer to the Macadamia integrifolia assembly, then I would add the Macadamia integrifolia assembly in this figure, and maybe put a dotted line at the limit of it for each category (InDels and mismatches) so it is easier for the reader to compare with it.
      • Figure 2:
        • Here I would put all the assemblies you had in Figure 1
    3. Comparison of long read methods for sequencing and assembly of a plant genome

      This preprint was published in GigaScience in December, 2020 and has an Update article in GigaByte - https://doi.org/10.46471/gigabyte.24

    1. An efficient and robust laboratory workflow and tetrapod database for larger scale eDNA studies

      This work has been peer reviewed in GigaScience, which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer 1: Taylor Wilcox http://dx.doi.org/10.5524/REVIEW.101629

      Reviewer 2: Han Ming Gan http://dx.doi.org/10.5524/REVIEW.101630

    1. Now published in Gigabyte doi: 10.46471/gigabyte.11 Bruno C. Genevcius Department of Genetics and Evolutionary Biology, University of São Paulo, São Paulo, SP, BrazilFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Bruno C. GenevciusFor correspondence: bgenevcius@gmail.comTatiana T. Torres Department of Genetics and Evolutionary Biology, University of São Paulo, São Paulo, SP, BrazilFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Tatiana T. Torres

      This work has been peer reviewed in GigaByte, which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer 1. Peter Thorpe. Are all data available and do they match the descriptions in the paper?

      No. They are submitted but still private. These need to be released.

      Final Comments: SRA datasets need to be released.

      Recommendation: Minor Revision

      Reviewer 2. Guillem Ylla.

      Is the language of sufficient quality? While the text is mostly clear, I detected a few spelling mistakes (listed below) and there might be more that escaped my attention. I would recommend the authors to exhaustively check the MS. Line 53: “Stink bug” missing “bug”. Lines 39,58,69, and figures: Mixed usage of “Chinavia impicticornis” and “C. impicticornis”. After first appearance of the full name, authors should be consistent whether they keep using the full name or the abbreviation, but not mixing both.

      Are all data available and do they match the descriptions in the paper?<br> No. The authors report multiple accession numbers from NCBI including a BioProject ID. But they are not open and I was unable to check if the data match the paper descriptions. The TSA accession seems that has not yet been created and the MS displays a placeholder (GIVF00000000) in its place.

      Are the data and metadata consistent with relevant minimum information or reporting standards? No. Missing items from the checklist. 1) "Any perl/python scripts created for analysis process ". In Line 94 “using a custom Perl script [16]”, the authors provide citation but not the code. 2) "Full (not summary) BUSCO results output files (text) ".

      Is the data acquisition clear, complete and methodologically sound?<br> Yes. The end of the fifth nymphal instar dataset was obtained at “seven days after molting from fourth to fifth instar”. Could authors specify how many days is the 5th nymphal instar to have a better idea of how much longer is the 5th nymphal stage.

      Could the authors briefly describe the rationale o behind choosing 5th nymphal and instead of other nymphal stages? They explain why nymphal stages were used instead of adults, but not why the 5th nymphal instar.

      Is there sufficient detail in the methods and data-processing steps to allow reproduction?<br> No. I would appreciate if the authors could share the code/commands for removing redundant reads and performing the assembly as supplementary materials or in GitHub (recommended).

      In the abstract, the authors describe 38,478 transcripts of which 12,665 had GO terms assigned. Is not clear where this number comes from. In line 120 is mentioned that “ 39,478 had successful matches in the NCBI”. Is there a type one of these two numbers (38,478 vs 39,478)? However, the MS says “we only kept contigs that matched to Arthropod species”, and this number is reported to be 33,871. I urge the authors to better explain the steps they followed and clarify where all these numbers come from.

      Is there sufficient data validation and statistical analyses of data quality?<br> Yes. Using the whole insect body often includes contaminant RNAs from the gut microbiome, endosymbionts, viruses, and other microbiological specimens from the cuticles and environment. Since the authors do not filter out reads from possible contaminants before the assembly, I would appreciate it if they could perform a BUSCO analysis using the prokaryote database before and after the selection based on similarity to databases. This would allow estimating the number of contaminants in the original assembly and if they had successfully discarded after the selection.

      Lines 126-127 are not clear. There are 12,665 contigs that have 5,087 GO terms. I deduce that there are 12,665 contigs that have at least 1 GO term, and that they contain 5,087 distinct GO terms. Could authors make it more clear on the text?

      Is there sufficient information for others to reuse this dataset or integrate it with other data?<br> Yes. I don’t think that a dataset consisting of 2-time points (early and late) of the same sarge (nymph 5) can be considered a “developmental transcriptome”. I would urge the authors to change the terminology and title.

      In the abstract, the authors claim that this is the “ first genome-scale study with”. Since the study is only transcriptomic, I find it misleading to define it as “genome-scale study”.

      1- I don’t think that a datasets consisting of 2 time points (early and late) of the same sarge (nymph 5) can be considered a “developmental transcriptome”. I would urge the authors to change the terminology and title.

      2- In the abstract, the authors claim that this is the “ first genome-scale study with”. Since the study is only transcriptomic, I find misleading to define it as “genome-scale study”.

      3- In table 1 and line 117 the authors claim that they generated the highest amount of RNA-seq reads for pentatomids to date. However, for the Halyomorpha halys there are multiple available RNA-seq datasets not mentioned, which taken together I suspect that they would accede the data generated for C. Impicticornis. I would suggest to reduce the tone of this statement of L117.

      4- Additionally, there are at least 3 available genomes for pentatomidaes species. I think that this information should at least be mentioned in the introduction.

      5- In line 61, could the authors define “almost nonexistent”, how many are there?

      Additionally, there are at least 3 available genomes for pentatomidaes species. I think that this information should at least be mentioned in the introduction.

      In line 61, could the authors define “almost nonexistent”, how many are there?

      Recommendation: Minor Revision

    1. Improvements in the Sequencing and Assembly of Plant Genomes

      This manuscript is an Update to a paper published in GigaScience in December 2020. See https://doi.org/10.1093/gigascience/giaa146

    2. Background

      This work has been peer reviewed in GigaByte, which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

      Chao Bian

      1. Is the language of sufficient quality? No

      2. Are all data available and do they match the descriptions in the paper? Yes

      3. Are the data and metadata consistent with relevant minimum information or reporting standards? See GigaDB checklists for examples http://gigadb.org/site/guide Yes

      4. Is the data acquisition clear, complete and methodologically sound? Yes

      5. Is there sufficient detail in the methods and data-processing steps to allow reproduction? Yes

      6. Is there sufficient data validation and statistical analyses of data quality? Yes

      7. Is the validation suitable for this type of data? Yes

      8. Is there sufficient information for others to reuse this dataset or integrate it with other data? Yes

    3. Abstract

      This work has been peer reviewed in GigaByte, which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

      Mile Šikić

      1. Is the language of sufficient quality? Yes

      2. Are all data available and do they match the descriptions in the paper? Yes

      3. Are the data and metadata consistent with relevant minimum information or reporting standards? See GigaDB checklists for examples (http://gigadb.org/site/guide) Yes

      4. Is the data acquisition clear, complete and methodologically sound? Yes

      5. Is there sufficient detail in the methods and data-processing steps to allow reproduction? Yes

      6. Is there sufficient data validation and statistical analyses of data quality? Yes

      7. Is the validation suitable for this type of data? Yes

      8. Is there sufficient information for others to reuse this dataset or integrate it with other data? Yes

      Additional Comments: In their update to the previous study on the comparison of long read technologies for sequencing and assembly of plant genomes, Sharma et al. presented a follow-up analysis using a newer generation of base callers for nanopore reads and PacBio HiFi reads. I argue that this study is an important update, but it is not suitable for publication in the current form.

      My major comments are the following:

      1. It is not clear which version of the base caller the authors used in assemblies related to Table 1 and Table 3.
      2. For phased assemblies, it is important to provide information about the size of alternative contigs
      3. In Table 1, it would be great to have results for methods that do not phase assembly (i.e. Flye).
      4. There is no explanation why authors use IPA instead of other HiFi assemblers, i.e. hifiasm, which from my experience, perform better than IPA.
      5. A sentence related to Table 3, “The quality of the assemblies was more contiguous with less data in each of these cases when HiFi reads were used instead of the earlier continuous long reads (Table 3).” is not clear. Following Table 3, assemblies achieved using long reads have similar or longer N50 and higher BUSCO score. Also, it is not clear which assembler was used for long reads.
    1. Abstract

      Reviewer 1. Wei Zhao Are all data available and do they match the descriptions in the paper? No

      The BioProject PRJNA667278 is currently not accessible.

      Is there sufficient data validation and statistical analyses of data quality? No

      The size of the final genome assembly is significantly larger than the estimated size, which is indicative of redundancy. I would suggest removing the potential haplotype redundancy further. I would also suggest a k-mer analysis to validate the genome size. For a chromosomal assembly, the ratio of properly paired reads is lower than expected.

      Additional comments annotated on the paper have been provided to the author.

      Major Revision

    2. Now published in Gigabyte doi: 10.46471/gigabyte.10

      Reviewer 2. Ramil Mauleon Are all data available and do they match the descriptions in the paper? No Additional Comments Bioproject PRJNA667278 in NCBI appears to be still embargoed, a reviewer link would be helpful.

      Are the data and metadata consistent with relevant minimum information or reporting standards? See GigaDB checklists for examples http://gigadb.org/site/guide No Additional Comments Sample provenance / passport information is lacking for the Cannbio-2 material. Outright mention of the source of RNAseq +TSA info in the methods would be helpful. Same comment as above for Genbank bioproject.

      Is the data acquisition clear, complete and methodologically sound? No Additional Comments It's mostly clear from the DNA extraction, pacbio sequencing and primary assembly. The anchoring of the assembled contigs into pseudochromosomes using another published genome lack detail and only broadly mention the software used (RaGOO). This is a very critical step that will distinguish if the Cannbio-2 assembly is an improvement vs the mentioned genome assemblies (esp. cs10, PK); it's a circular argument if the genome assembly is ascertained against existing assemblies from other cannabis accessions and declared improved. As noted by the authors, there are differences (rather than inconsistencies) between the compared published genomes, and these may be inherent in each genome; any analyses on an assembly based on these would cause ascertainment bias. Is there sufficient detail in the methods and data-processing steps to allow reproduction? No Additional Comments The previous comment regarding anchoring of contigs to an existing genome applies to this as well. Regarding genome annotation, is there any basis for the choice of annotation method, i.e. annotator software (Augustus), the consensus builder (EVN), and PASA ? MAKER (MAKER-P) and BRAKER are available pipelines, both being reported as good for plants, and GeneMark is a prediction software suite that excels in plant genome annotation. Re, evidences for annotation, it appears that transcript de novo assemblies were used, but the RNAseq data was not incorporated in the prediction step. No orthologous protein databases appear to have been used as hints for gene prediction. These are just observations/suggestions to further improve annotation quickly. In general, the annotation steps would benefit from a bit more detail for reproducibility, but I would say the annotation if done at the contig level would be very solid.

      Is there sufficient data validation and statistical analyses of data quality? No Additional Comments On the assembly itself, since there was no mention of the method for anchoring contigs into chromosomes, there is no information on how scaffolds are spaced along the genome, is it padding by a fixed # Ns? Are all assembled contigs anchored or are there unanchored ones? Again on the point of anchoring and ordering of contigs, ideally evidence from the same sequenced material would be the best to use (an example - genetic linkage map with sequence-based markers). Plant genomes are notorious for rearrangements (inversions, insertions, translocations, tandem repeats etc) even within species, and this appears to be the weakest evidence in this paper (how the contigs were anchored into chromosomes). Re gene annotation, you can conduct the BUSCO on the predicted genes and report those as well. Again, results will reflect the outcome of the annotation method used. For BUSCO in general, I'd be cautious in comparing results across published genomes and it would be more informative during an optimization of the assembly methodology or testing different assembly methods (checking whether you are improving the assembly of the same underlying dataset). On this same topic, are the unmapped contigs from other assemblies used? The same question with the assembly done by the authors apply.

      Is the validation suitable for this type of data? No Additional Comments Mostly yes for the primary genome assembly. The pseudochromosome assembly analysis data validation is not convincing. If done at the contig level, the genome annotation would be solid.

      Is there sufficient information for others to reuse this dataset or integrate it with other data? No Additional Comments Recapping, missing are the biomaterial information,information on pseudochromosome assembly, explicit mention of genbank IDs for transcript assembly and RNAseq data used in annotation (instead of being in the reference) would improve re-use and integration. On the chromosome nomenclature, I don't understand why the author doesn't mention the ongoing nomenclature being used by the community as reported in the NCBI cs10 refseq release.

      Any Additional Overall Comments to the Author I believe reporting on results based on the main evidences generated by the authors (in this current work and the previous one on transcriptome) would make this a stronger data release, i.e. contig/scaffold assemblies, the annotation of that based on your own RNAseq data . On a related note, have you tried using your short-reads data during assembly? Could your assembly have been improved if you used the Illumina data during assembly itself (hybrid assembly, scaffolding)? Cannabis genomes are known to be highly heterozygous, a report of this would be easy to conduct from your assembly vs your reads dataset especially the short-reads and would be an important finding.

      Recommendation Major Revision

    1. Bone mass loss

      Reviewer 1. Levi Waldron Wang et al. present a shotgun metagenomics cross-sectional study of fecal specimens from 361 elderly women with the primary objective of identifying correlations between bone mass density and microbial taxa. The methods are reasonable and I have no major concerns about this manuscript, only some moderate suggestions to improve reporting and discussion.

      For items answered “Yes” it would help to provide line numbers in the manuscript, as done for some but not all checklist items.

      3.0 Participants:

      It’s stated that “Fecal samples of 361 post-menopause women were randomly collected at the People’s Hospital of Shenzhen” – I suspect the correct word here is “arbitrarily” rather than “randomly”, unless a random number generator was used to select a random sample of all eligible patients. Some statement of how the women were recruited and how representative they are of all patients at the hospital is warranted. E.g. were they recruited from emergency room, a cancer ward, all outpatients, all admitted patients, etc? See also later comment about generalizability.

      4.9 Batch Effects:

      This is left “NA” – can the authors at least comment (in the manuscript) on the potential for batch effects affecting cases and controls differently – ie were they all prepared together or in separate libraries, and were they sequenced in the same runs or completely separated?

      8.0 Reproducible research:

      I appreciate that data have been posted at EBI and CNGB. Could the authors also comment on whether the metadata essential to the analysis are also provided, and that these can be linked to the sequence data? Although I’m glad to hear that “Others could reproduce the reported analysis from clean reads by the declared software and parameters” I do think that the code to reproduce the analysis should also be reported.

      8.1 Raw data access

      The checklist states “no raw reads for ethical” but the manuscript states “The sequencing reads from each sequencing library have been deposited at EBI with the accession number: PRJNA530339 and the China National Genebank (CNGB), accession number CNP0000398.” so there is a disconnect. Assuming human sequence reads are removed from the data, I’m not convinced of ethical reasons not to post microbial sequence reads, but it seems the authors have posted the microbial sequence reads.

      10.1 – 10.5 Taxonomy, differential abundance, other analysis, other data types, and other statistical analysis are all blank. Some should be “N/A” but others just seem to be overlooked.

      13.2 Generalizability: I think this is an important element to include in the discussion. How typical are your volunteers of all women that age?

      Minor:

      “Making these data potentially useful in studying the role the gut microbiota might play in bone mass loss and offering exploration into the bone mass loss process.” -> These data are potentially useful in studying the role the gut microbiota might play in bone mass loss and in exploring the bone mass loss process.

      The manuscript is well written, but there are a few other places that would benefit from some copy editing.

    2. Abstract

      Reviewer 2. Christopher Hunter Is the language of sufficient quality?

      Yes.

      Is the data all available and does it match the descriptions in the paper?

      No.

      Most of the data are provided as supplemental files in biorXiv, but in Excel rather than CSV. These data files will need to be curated into a GigaDB dataset.

      Is the data and metadata consistent with relevant minimum information or reporting standards?

      Yes.

      Is the data acquisition clear, complete and methodologically sound?

      No.

      Comment. The consent by the patients to openly share all metadata is not clearly stated, simply saying the study was approved by the bioethics review board does not mean consent was given to share the data, just that the institute consent to the study being done.

      Is there sufficient detail in the methods and data-processing steps to allow reproduction?

      No.

      Comments: Maybe to someone with a good understanding of statistics there is sufficient detail, this is an area that a statistician should look at. For me, the descriptions of the analysis and the methods do not given anywhere near enough detail for me to either understand what was done or replicate it. The concept of "Gut metabolic modules" is not defined here, with just a reference to another paper, a brief explanation of what is meant by the term here would be useful.

      Is there sufficient data validation and statistical analyses of data quality?

      Yes.

      Comments. The sequences were filtered for human contaminants and adapter seq, also low quality reads were removed.

      Is the validation suitable for this type of data?

      No.

      Comments: The metadata is extensive but there are some basic points that are missing; collection date, antibiotic use, relatedness of samples/patients. Other less important details are also missing, like why and how this cohort was selected.

      Is there sufficient information for others to reuse this dataset or integrate it with other data?

      Yes

      Any Additional Overall Comments to the Author

      Yes

      • I am concerned about the open sharing of patient metadata without the evidence that it was consented prior to sharing. - A lot of metadata is collected and provided in the supplemental tables (which is great for reuse) but there are no explanations of what the values are, while some headers are self explanatory others less-so e.g. what is CROSSL(pg/ml)? or "Side crops", - how were the various conditions diagnosed? - I see no indication of antibiotic usage in the cohort - Are all the samples from different individuals? was each sample a single bowl movement? - There is no background given as to how this cohort was selected or why. - The is no discussion of the bone mass density of a "normal" cohort, does this cohort represent a normal cohort or is it already biased toward low or high density? Simply describing the cohort with respect to Normal (T of -1 or above), low (-1to-2.5) or osteoporosis (< -2.5) would be a help. I cannot see the T-scores included in the sTab1a file, are they computed from the L1-L4(z) values given? - There are a number of NA values in the table of samples metadata, but there is no explanation as to how these samples where handled in the analysis. - In general I feel that there is a lot of poorly described statistical analyses included that are not required as part of a data note, the focus should be on describing the data and ensuring the data and metadata are well explained.
    3. certified by peer review

      This work has now been published in GigaByte here: https://doi.org/10.46471/gigabyte.12

    1. Now published in Gigabyte doi: 10.46471/gigabyte.9

      Reviewer 2. Levi Waldron Chen et al. use 16S amplicon metagenomic sequencing to investigate urinary bacterial communities and their correlation to lifestyle and clinical factors, and reproductive tract (cervix, uterine cavity, vagina) microbiota in a cross-sectional study of 147 Chinese women of reproductive age. This is an important but challenging study, because of the threat of microbial contamination in low microbial biomass specimens such as the upper reproductive tract and urine.

      Checklist item 4.0

      The laboratory /center where laboratory work was done is not actually stated in lines 121-133.

      Negative controls and contamination

      Negative controls were generated for the 10 women undergoing surgery through as sterile saline collected through the urine catheter. I assume this was done after the catheter was used for urine collection, but this should be stated.

      No negative controls were used for the self-collected urine specimens. However it seems likely that mid-stream self-collection would be more prone to contamination than catheter sampling by a doctor during surgery. Some possibilities for negative controls in this setting exist, such as including a sample of sterile saline with the self-collection kit and asking participants to fill another vial with it immediately following urine collection. The lack of negative controls for self- collected specimens should be stated as a limitation.

      The authors identify the risk of contamination from vulvovaginal region (lines 192-193) but not of cross-contamination. Discussion of the risk of cross-contamination during collection and subsequent processing, steps to mitigate and identify it, and comparison of results to bacterial taxa identified as common contaminants (e.g. Eisenhofer et al, PMID 30497919), is warranted.

      Comparability of urine sampling methods

      Since no specimens were collected by both self-collection and catheter sampling during surgery, there is no way to directly assess the accuracy of self-collection using catheter as a “gold standard” This should be stated as a limitation.

      I could not find an analysis comparing the microbial composition of the catheter-collected and self-collected specimens. Some analysis comparing the two could help address the quality of self-collected specimens lacking negative controls.

      Discussion

      The authors do not include overall interpretation or limitations in the Discussion, saying under checklist items 12.0, 12.1, 13.0 “The discussion was suggested to focus on the potential uses according to the article format.” I think the editors should clarify to authors where these key discussion points belong. I think no article is complete without some discussion of limitations; see above for limitations noted of this study.

      Checklist item 13.2 Generalizability

      Authors state “The generalizability of the study is to women of reproductive age, and is shown in line 236-237” but on these lines I see description of statistical methods. This does deserve some discussion though, because the sample includes only women who underwent hysteroscopy and/or laparoscopy for conditions without infections, and has a number of exclusion criteria. This cannot be a representative cross-section of all women of reproductive age, so some discussion of how this sample may be different or similar to the population of all women of reproductive age is warranted. If the authors claim this sample should be generalizable to all women of reproductive age, that should be stated along with the intentional restrictions of the sampling and rationale of why these criteria are not expected to have any impact on the microbiota sampled.

      Clustering of patients

      Lines 212-213: cutting a hierarchical clustering into discrete groups can be done for any dataset, and without some analysis such as Prediction Strength (Tibshirani and Walther, J. Comput. Graph. Stat. 14, 511–528 (2005)) or another measure of cluster validation, this isn’t evidence of distinct patient groups and that should be stated clearly. It is OK to use the grouping to discuss general trends as long as care is made not to imply these are distinct patient subsets without further analysis. I am cautious about this because distinct subsets are intuitively appealing to many readers and the existence of distinct subsets can be harder to correct than to claim.

      Minor

      Line 241 “As the large-scale cohort” -> As a large-scale cohort

    2. Abstract

      Reviewer 1. Christopher Hunter Is the language of sufficient quality?

      Yes.

      Is the data all available and does it match the descriptions in the paper?

      No.

      Comment: line 96-97 "In this study, a total of 147 reproductive age women (age 22-48) were recruited by Peking University Shenzhen Hospital (Supplementary Table 1)." B utSup. table 1 has only 137 samples. Revise text to explain only 137 samples were used for the main analysis, with the 10 extra for validation. Line 103 -104 "None of the subjects received any hormone treatments, antibiotics or vaginal medications within a month of sampling." Sup Table 1 has a column for "Antibiotic use True/False", 41 samples have "T"? this needs explaining. Its possible the spreadsheet True is referring to a longer time period, but thats not explained anywhere. line 110-112 "The samples from an additional 10 women were collected for validation purposes by a doctor during the surgery in July 2017." Where are these metadata? they are not included in Sup table 1. The data presented and discussed in "additional-findings.docx" are not included in the data files (yet), these should either be removed (as not included in the main article), or expand upon the methods (to include negative control details) and add this to main text.

      Is the data and metadata consistent with relevant minimum information or reporting standards?

      Yes.

      Comment. The supplemental tables need some better legends/descriptions to help readers understand what data is in them.

      Is the data acquisition clear, complete and methodologically sound?

      Yes.

      Comment. The wet and bioinformatics methods could benefit from being included in protocols.io

      Is there sufficient detail in the methods and data-processing steps to allow reproduction?

      Yes

      Is there sufficient data validation and statistical analyses of data quality?

      Yes

      Is the validation suitable for this type of data?

      Yes

      Is there sufficient information for others to reuse this dataset or integrate it with other data?

      Yes

      Any Additional Overall Comments to the Author

      Yes

      The Figure appear to be mixed up, what’s displayed as Figure 1 in the manuscript appears to relate to the legend given for Figure 2, Figure 2 relates to legend of Figure 3, and Figure 3 relates to the legend of Fig 1!!! line 69 -Chen et al. no citation number link provided line 74 -Thomas-White et al. (2018) no citation number link provided line 79 -Gottschick et al. (2017) no citation number link provided line 246-248 "The initial results here indicate a close link between the urinary microbiota with the general and diseased physiological conditions,... " As this study is looking at "Healthy" individuals I do not believe there is sufficient evidence to back up this statement about the "diseased" physiological conditions. line 274-275 "The sequences of bacterial isolates have been deposited in the European Nucleotide Archive with the accession number PRJEB36743" this accession is not public so I am unable to see whats included here. If available we would like to see the Real-Time PCR Data from the experiments made available in Real-Time PCR Data Markup Language (RDML). The additional cohort of 10 women is almost a different study, it didn't have the same 16s RNA amplicon sequencing done, and was only a validation that some live bacteria can be cultured from urine in a small number of cases (3/10). If it is to be included table S5 should be updated to include the specific INSDC accessions for the submitted sequences. (title of Table S5 in file is currently saying Table 1).

    1. Now published in Gigabyte doi: 10.46471/gigabyte.8

      Reviewer #1 : Review MS by Wei Zhao Data Release Checklist Reviewer name and names of any other individual's who aided in reviewer Wei Zhao

      Is the language of sufficient quality? Yes Please add additional comments on language quality to clarify if needed<br> Are all data available and do they match the descriptions in the paper? Yes Additional Comments<br> Are the data and metadata consistent with relevant minimum information or reporting standards? See GigaDB checklists for examples http://gigadb.org/site/guide Yes Additional Comments<br> Is the data acquisition clear, complete and methodologically sound? Yes Additional Comments<br> Is there sufficient detail in the methods and data-processing steps to allow reproduction? Yes Additional Comments See attached PDF file Is there sufficient data validation and statistical analyses of data quality? No Additional Comments Check and filter potential contamination of the raw assembly. Is the validation suitable for this type of data? Yes Additional Comments But maybe no, see attached pdf Is there sufficient information for others to reuse this dataset or integrate it with other data? Yes Additional comments annotated on the paper and shared with the authors. Recommendation Major Revision

      Reviewer #2 : Review MS by Daniel Lang Data Release Checklist Reviewer name and names of any other individual's who aided in reviewer Daniel Lang

      Is the language of sufficient quality? Yes Please add additional comments on language quality to clarify if needed<br> Are all data available and do they match the descriptions in the paper? Yes Additional Comments<br> Are the data and metadata consistent with relevant minimum information or reporting standards? See GigaDB checklists for examples http://gigadb.org/site/guide Yes Additional Comments<br> Is the data acquisition clear, complete and methodologically sound? Yes Additional Comments<br> Is there sufficient detail in the methods and data-processing steps to allow reproduction? Yes Additional Comments<br> Is there sufficient data validation and statistical analyses of data quality? Yes Additional Comments There is a exceptionally high number of scaffolds for 10x, a bad BUSCO and a discrepancy between kmer <-> fcm&assembly size that is unusual. That would have been worthy of discussion. Is the validation suitable for this type of data? Yes Additional Comments<br> Is there sufficient information for others to reuse this dataset or integrate it with other data? Yes Additional Comments<br> Any Additional Overall Comments to the Author<br> Recommendation Accept

    1. Now published in Gigabyte doi: 10.46471/gigabyte.7

      This work has been peer reviewed in GigaByte, which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer 1: Qiye Li Since I am unable to access the data submitted to NCBI or GigaDB, I cannot judge this issue currently. Please make sure that the gene annotation, repeat annotation, transcriptome assembly, gene expression matrix, and genetic variant data have been uploaded somewhere in addition to the raw reads and genome assembly.

      While the bioinformatic tools used in all the steps are indicated clearly, the parameters for many tools are not defined.

      What is the gap ratio (i.e. % of unclosed gaps or Ns) of the genome assembly? As I know, the raw Supernova assembly may have a high proportion of gaps, although the scaffold N50 is pretty good. Additional gap closer steps (e.g. using GapCloser, RRID:SCR_015026) would improve the completeness of the assembly.

      BUSCO analysis is competent to access the completeness of the protein-coding gene space of the genome assembly. But a good BUSCO score does not necessarily mean good assembly completeness. Another conventional way to demonstrate the completeness of the assembly is to show the metrics of DNA read mapping, such as the overall mapping rate, % in proper pair, % of covered bases, etc.

      How is the completeness of the gene set generated by the Fgenesh++ pipeline? I suggest that the authors provide BUSCO score for the Fgenesh++ gene set as they did for the transcriptome assembly.

      Methods related to Alzheimer’s Genes Analysis: The methods used to identify the Alzheimer’s disease (AD) related human genes in antechinus seem to be flawed, as the authors only performed unidirectional searches for homologs in the antechinus gene set. I think the authors should identify bona fide orthologs of these AD-related genes in antechinus. The conventional way to determine orthologs between two species is based on a reciprocal best hit (RBH) strategy (i.e. RBHs between the human and antechinus gene sets).

      Reviewer 2: Walter Wolfsberger PRJNA664282 accession number is not found on NCBI. Is it scheduled to be released with the publication?

      Appropriate tools were used for appropriate analyses. The Y chromosome identification approach seems sound.

      The bioinformatic approaches the authors tools are sound, with the right tools and approaches to the analysis.

      The prep-print is well worded and easy to understand and follow. It provides good amount of context, that justifies the extra analyses done in the publication. The assembly quality is adequate, with relatively low N50, but good completeness scores, given that mammalian genomes have higher levels of low complexity\repetitive content. The metrics presented adhere to the scope of GigaByte, and the data itself is valuable to the scientific community.

    1. Genome sequencing

      Reviewer 2: Mahul Chakraborty

      Reviewer Comments to Author: In "Two high-quality de novo genomes from single ethanol-preserved specimens of tiny metazoans (Collembola)." Schneider et al. described de novo genome assemblies of two tiny field collected Collembolan specimens. The authors collected high quality genomic DNA from the specimens following a Pacfiic Biosciences recommended protocol for ultra low input library, amplified them, and generated adequate sequence coverage to generate contiguous assemblies. This is a significant step forward in generating de novo genome assemblies from small amounts of tissues and cells and therefore will be a useful guide for not only people who are studying whole organisms but also people who are studying variation between cell or tissue types within an individual. I have some minor comments: "They were preserved in 96% ethanol, kept at ambient-temperature for one day until they would be stored at -20°C for 1.5 months, until DNA extraction." - Was the preservation at -20 a deliberate step to see the effect of this treatment on sequencing or just a conscious choice for specimen preservation? The specific conditions used (e.g. the time and speed of centrifuge) for the g-Tube shearing needs to be added in the Methods. "Circularity was validated manually, and nucleotide bases were called with a 75% threshold Consensus.?" - please clarify what the 75% threshold consensus is. "We then performed another estimation of the genome size by dividing the number of mapped nucleotides by mode of the coverage distribution" - Why was this done? Did the authors suspect the Genomescope estimate to be incorrect? "We compared our new genomes sequenced to previous Collembola assemblies that were generated with long read and sometimes additional short read data." - This statement needs citations for the previous Collembola assemblies. The authors used blastn and megablast to search the beta-lactams synthesis genes in the new assembly. Tblastx might be more appropriate. "For D. tigrina a total of 20,22 Gb HiFi data (Q>=20) was generated," - Do you mean 20.22 ? "For S. aquaticus a total of Gb HiFi data (Q>=20) was generated" - missing the number before Gb The authors report only one assembly from hifiasm, which I presume is the primary assembly. Given that the authors assembled diploid individuals, I am curious whether hifiasm assembled the alternate haplotype sequences. "The insect genomes have higher BUSCO scores (96.5 and 99.6%), but lower contiguity (Table 2, Fig. 3)."

      • This statement is incorrect. A number of insect genomes are more contiguous than the assemblies presented here, including Drosophila melanogaster (PMID: 31653862) and several other Drosophila species, Anopheles stephensi (DOI:10.1101/2020.05.24.113019), Anopheles albimanus (PMID: 32883756)
    2. ABSTRACT

      Reviewer 1. Arong Luo

      Reviewer Comments to Author: First, I'd like to commend the authors on attempting to sequence whole genomes of tiny metazoans, which account for a large part of biodiversity in nature and yet are difficult to be sequenced. Second, I am impressed by their ethanol-preserved specimens, which thus make genome sequencing more applicable and attractive in practice. We must admit that sometimes we cannot use fresh specimens directly for genome sequencing. Thus, I think this manuscript is really of scientific significance for specific fields such as insects. I found that the focal part of their sequencing protocol is the "whole genome amplification-based Ultra-Low DNA Input Workflow for SMRT Sequencing (PacBio)" throughout the text, which of course is very complex. So, I suggest the authors provide a flowchart showing critical or main steps during their workflow, and the readers can then understand easily and refer to their workflow in future projects. Finer points: Line 35: I suggest providing specific/important information for the 'novel' protocol herein. Line119-120: Are the specimens later for DNA extraction also morphologically identified? Line130-131: The DNA extract was selected randomly or based on certain measurements? Line 393: delete the dot '.'

    1. Gigabyte doi: 10.46471/gigabyte.6

      Reviewer 2. Yunyun Lv

      Do you understand and agree to our policy of having open and named reviews, and having the your review included with the published papers. (If no, please inform the editor that you cannot review this manuscript.) Yes

      Is the language of sufficient quality? Yes

      Are all data available and do they match the descriptions in the paper? No

      Are the data and metadata consistent with relevant minimum information or reporting standards? See GigaDB checklists for examples http://gigadb.org/site/guide No

      Is the data acquisition clear, complete and methodologically sound? Yes

      Is there sufficient detail in the methods and data-processing steps to allow reproduction? Yes

      Is there sufficient data validation and statistical analyses of data quality? Yes

      Is the validation suitable for this type of data? Yes

      Is there sufficient information for others to reuse this dataset or integrate it with other data? No

      Any Additional Overall Comments to the Author This study presents a chromosome-level genome assembly of common dragonet. Hi-C method was applied to generate the high-quality genomic assembly. The result is valuable for further genomic analysis. However, some basic question should be solved or answered in the article to give a clearer insight.

      Line 35 findings section: The annotated total gene number and their quality should be evaluated and presented in the findings section. Line73-Line75:This sentence contains much speculation. I feel it should be removed or just mention the sympatry of their living location. Line 220: The section mainly described the method of gene annotation, however, the corresponding result is absent. These results are important to perform the various comparative genomic analysis. Thus, a detailed description of gene annotation result should be required in the revision. Line 238: Availability of supporting data; I searched the project accession number in NCBI database, but found no result. Thus, the supporting data is not unavailable in current.

      Line 33,type error: “syngnatiforms” should be syngnatiformes

      Recommendation Major Revision

    2. Now published

      Reviewer 1. Chao Bian

      Do you understand and agree to our policy of having open and named reviews, and having the your review included with the published papers. (If no, please inform the editor that you cannot review this manuscript.)

      Yes

      Is the language of sufficient quality? Yes

      Are all data available and do they match the descriptions in the paper? Yes

      Are the data and metadata consistent with relevant minimum information or reporting standards? See GigaDB checklists for examples http://gigadb.org/site/guide Yes

      Is the data acquisition clear, complete and methodologically sound? Yes

      Is there sufficient detail in the methods and data-processing steps to allow reproduction? Yes

      Is there sufficient data validation and statistical analyses of data quality? Yes

      Is the validation suitable for this type of data? Yes

      Is there sufficient information for others to reuse this dataset or integrate it with other data? Yes

      Any Additional Overall Comments to the Author?

      This paper, entitled ‘Chromosome-level genome assembly of a benthic associated Syngnathiformes species: the common dragonet, Callionymus lyra’, has provided a reference genome of the common dragonet with a high contig and scaffold N50 values. The genome size estimation, gene and repeat annotation were also performed in this study. The analysis approaches, such as genome assembling, annotation, are solid and well performed.

      However, for the gene annotation, there was no homology-based annotation for gene annotation. On the other hand, why the authors have not used the HISAT or Tophat to map the RNA reads onto genome to predict the gene structure. I really rarely see the transcriptome annotation by using the trinity assembly.

      In addition, I still consider that the first published genome should have at least one analysis point for illuminating the molecular mechanism of the special character of this species. Only an assembly and some genes will largely reduce the impacts and interests for this fascinating fish species.

      Some minor mistakes should be changed: The decimal place through whole paper should be uniformed. Line 41, 538 Mbp should be 538.0 Mbp. Line 45, 27.66% should be 27.7%. Line 76, change “suggest” to “suggests”. Line 83 and line 94, for “see [9]” and “by [10]”, the author’s name should be indicated in text, like “see XX’s study [9]”. Line 104, tissue should be tissues. Line 120 and line 131, change ‘562’ to ‘562.0’, and change ‘645’ to ‘645.0’. Line 156, explains should be explain.

      Recommendation

      Major Revision

    1. Now published in GigaScience doi: 10.1093/gigascience/giaa079

      This work has been peer reviewed in GigaScience, which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer 1: Jing Zhao http://dx.doi.org/10.5524/REVIEW.102331 Reviewer 2: Emre Guney http://dx.doi.org/10.5524/REVIEW.102332

    1. Now published in GigaScience doi: 10.1093/gigascience/giab042

      This work has been peer reviewed in GigaScience, which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer 1: Karen Ross http://dx.doi.org/10.5524/REVIEW.102747 Reviewer 2: Carlos P. Cantalapiedra http://dx.doi.org/10.5524/REVIEW.102749

    1. Now published in Gigabyte doi: 10.46471/gigabyte.2 Qiye Li 1BGI-Shenzhen, Shenzhen 518083, China2State Key Laboratory of Genetic Resources and Evolution, Kunming Institute of Zoology, Chinese Academy of Sciences, Kunming 650223, ChinaFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Qiye LiQunfei Guo 1BGI-Shenzhen, Shenzhen 518083, China3College of Life Science and Technology, Huazhong University of Science and Technology, Wuhan 430074, ChinaFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteYang Zhou 1BGI-Shenzhen, Shenzhen 518083, ChinaFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Yang ZhouHuishuang Tan 1BGI-Shenzhen, Shenzhen 518083, China4Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 611731, ChinaFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteTerry Bertozzi 5South Australian Museum, North Terrace, Adelaide 5000, Australia6School of Biological Sciences, University of Adelaide, North Terrace, Adelaide 5005, AustraliaFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Terry BertozziYuanzhen Zhu 1BGI-Shenzhen, Shenzhen 518083, China7School of Basic Medicine, Qingdao University, Qingdao 266071, ChinaFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteJi Li 2State Key Laboratory of Genetic Resources and Evolution, Kunming Institute of Zoology, Chinese Academy of Sciences, Kunming 650223, China8China National Genebank, BGI-Shenzhen, Shenzhen 518120, ChinaFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteStephen Donnellan 5South Australian Museum, North Terrace, Adelaide 5000, AustraliaFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Stephen DonnellanGuojie Zhang 2State Key Laboratory of Genetic Resources and Evolution, Kunming Institute of Zoology, Chinese Academy of Sciences, Kunming 650223, China8China National Genebank, BGI-Shenzhen, Shenzhen 518120, China9Center for Excellence in Animal Evolution and Genetics, Chinese Academy of Sciences, 650223, Kunming, China10Section for Ecology and Evolution, Department of Biology, University of Copenhagen, DK-2100 Copenhagen, DenmarkFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Guojie ZhangFor correspondence: guojie.zhang@bio.ku.dk

      This work has been peer reviewed in GigaByte, which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

      Review 1. Walter Wolfsberger Is the language of sufficient quality? Yes.

      Is the data all available and does it match the descriptions in the paper? Yes.

      Is the data and metadata consistent with relevant minimum information or reporting standards?

      Comment: The accession number for GigaDB provided in the paper does not yield any results in the GigaDB search. Using the species name works though.

      Is the data acquisition clear, complete and methodologically sound?

      Comment: Although it is clear in the paper that a significant portion of data was discarded during the early QC step, there is no indication of the reason for it, or the nature of the problem that was encountered. For total in the paper, the research group produced 396 Gb of raw sequence(211 Short insert and 185 long insert libraries) out of which only 180(130 Gb Short insert and never mentioned 55Gb Long insert) were used later on for the assembly. Upon a single library FastQC analysis I have encountered extreme levels of sequence duplication that might indicate the libraries were not diverse or there was a PCR-artifact(like overamplification), that might have lead to this low-quality initial data. The parameters for tool SoapNuke, used in early QC are not defined.

      Is there sufficient detail in the methods and data-processing steps to allow reproduction?

      Is there sufficient data validation and statistical analyses of data quality? Yes.

      Is the validation suitable for this type of data?

      Comments: The assembly followed a logical order, with appropriate tools used at every step.

      Is there sufficient information for others to reuse this dataset or integrate it with other data?

      Comment: Although the resulting assembly was of moderate quality(highly fragmented, but good BUSCO score), a randomly picked library showed a really high duplication rates for sequencing, which indicates that there might be problems for future data reuse. Addressing these issues or at least acknowledging them would benefit the whole report and the dateset.

      Additional Comments:

      I don't think physical coverage is used widely in genome assembly as of now, as given the mate-pair reads nature - it inflates this statistics. I would put the resulting assembly statistics in a table, including all of the metrics(N50, N of Contigs, N of Scaffolds, Average Contig length and etc.) adding BUSCO score to the table, as the current formatting is not readable.  

      Review 2. Nandita Mullapudi Is the language of sufficient quality? Yes.

      Is the data all available and does it match the descriptions in the paper? Yes.

      Is the data and metadata consistent with relevant minimum information or reporting standards?

      Comment: I am unaware of defined reporting standards for assembly reports, however, all sample preparation, data generation and analysis methods have been described in adequate amount of detail.

      Is the data acquisition clear, complete and methodologically sound? Yes.

      Is there sufficient detail in the methods and data-processing steps to allow reproduction?

      Comment: Following additional details would help to enable reproduction: (1) Parameters used for data pre-processing using SOAPnuke, as well as related adapter sequences etc. These would be necessary to reproduce the data clean up step. (2) Memory, processor and time details of computational resource used for assembly (3) Was Platanus assembly attempted using different parameters, how were the parameters reported in the paper arrived at? (4) For gene prediction, several vertebrate sequences were used, the details/source of these reference sequences are missing.

      Is there sufficient data validation and statistical analyses of data quality?

      Comments: 1) One approach to validating an assembly would be to use more than one assembly tool and compare the results. (This may or may not be within the scope of this study.) 2) With respect to the validation performed by mapping back paired end reads to the assembly, there is no discussion of the ~14% of paired end reads that did not map back in the expected orientation. Would tools like REAPR (https://www.sanger.ac.uk/science/tools/reapr) or SEQuel (https://bix.ucsd.edu/SEQuel/man.html) be appropriate to address this? (given the high level of heterozygosity in L. d. dumerilii as reported here).

      Is the validation suitable for this type of data? Yes.

      Is there sufficient information for others to reuse this dataset or integrate it with other data?

      Comments: It may also be helpful to make available the set of cleaned reads, to enable reproduction of the assembly pipeline.

    1. Now published in GigaScience doi: 10.1093/gigascience/giab045 Florian Heyl 1Bioinformatics Group, Department of Computer Science, University of Freiburg, Freiburg, Georges-Köhler-Allee 106, 79110 GermanyFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Florian HeylFor correspondence: heylf@informatik.uni-freiburg.de backofen@informatik.uni-freiburg.de

      This work has been peer reviewed in GigaScience, which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer 1. (Eric Van Nostrand) http://dx.doi.org/10.5524/REVIEW.102771 Reviewer 2. (Nejc Haberman) http://dx.doi.org/10.5524/REVIEW.102769<br> Reviewer 3. (William Lai) http://dx.doi.org/10.5524/REVIEW.102770

  7. Jun 2021
  8. gigabytejournal.com gigabytejournal.com
    1. CODECHECK certificate of reproducible computation

      See more on how this works in GigaBlog http://gigasciencejournal.com/blog/codecheck-certificate/

  9. May 2021
    1. Here, we report the chromosome-level genome of the venomous Mediterranean cone snail, Lautoconus ventricosus (Caenogastropoda: Conidae).
    2. comprehensive catalogue of transcripts

      See the previous GigaScience paper looking at high-throughput identification of conotoxins using multi-transcriptome sequencing https://doi.org/10.1186/s13742-016-0122-9

    1. The first version appeared online 12 years ago and has been maintained and further developed ever since, with many new features and improvements added over the years.

      Read more on these developments in the Q&A with the authors http://gigasciencejournal.com/blog/play-it-again-samtools/

  10. Apr 2021
    1. This study describes the serendipitous discovery of Rickettsia amplicons in the Barcode of Life Data System (BOLD), a sequence database specifically designed for the curation of mitochondrial DNA barcodes.

      Find out more in this GigaBlog posting on the project http://gigasciencejournal.com/blog/rickettsia-bacteria-to-rule-them-all/

  11. Mar 2021
  12. Feb 2021
    1. Additional testing of pipeline portability is currently being conducted as a part of the Global Alliance for Genomics and Health (GA4GH) workflow portability challenge

      For more on how this went and an update on where the platform has developed to in Feb 2021 can be viewed in this video from CWLcon2021 https://youtu.be/vV4mmH5eN58

  13. Jan 2021
  14. Dec 2020
    1. The tutorials from the Galaxy Training Network along with the frequent training workshops hosted by the Galaxy community provide a means for users to learn, publish, and teach single-cell RNA-sequencing analysis.

      See the write-up by the Earlham Institute for more on how this training is going on

  15. Oct 2020
  16. Aug 2020
    1. Consortia Advancing Standards in Research Administration Information (CASRAI) contributorship taxonomy

      Which has now become NISO CRediT (Contributor Roles Taxonomy), see http://credit.niso.org/

  17. Jul 2020
    1. All supporting data and materials are available in the GigaScience database, GigaDB [48].

      Sequencing data is also in NCBI via BioProject: PRJNA576514 and proteomics data is in PRIDE: PXD018943

  18. Jun 2020
    1. The annual Parasite Awards

      See Parasite Award website here: https://researchparasite.com/

    2. SRA and Gene Expression Omnibus (GEO)

      See the NCBI short read archive (SRA) and Gene Expression Omnibus (GEO) databases here:

      https://www.ncbi.nlm.nih.gov/sra/ https://www.ncbi.nlm.nih.gov/geo/

      Other centralized data repositories are available.

    3. In 2018, eLife published a demonstration
    4. VMs

      We wrote about our first experiences publishing virtual machines back in 2014 http://gigasciencejournal.com/blog/publishing-our-first-virtual-box-of-delights-to-aid-the-fight-against-heart-disease/

    5. Hypothes.is

      GigaScience has hypothes.is integration, and you can read more about how we are using it to add value to papers in this GigaBlog posting http://gigasciencejournal.com/blog/hypothes-is-integration/

    1. teaching hands-on genome assembly courses

      Another example is the Bauhinia Genome project that has used the crowdfunded genomics data to educated the public and teach MSc students at the Chinese University of Hong Kong genome assembly http://bauhiniagenome.hk/2018/03/crowdfunded-genomes-and-the-plant-genome-big-bang/

    2. even small research groups

      There have even been community funded genome projects such as the "peoples parrot" and Azolla genome project, with strong education components such as this one (although using short reads to make an initial draft genome) http://gigasciencejournal.com/blog/community-genomes-from-the-peoples-parrot-to-crowdfernding/

  19. May 2020
  20. rvhost-alpha.rivervalleytechnologies.com rvhost-alpha.rivervalleytechnologies.com
    1. Members of the Free State Society for the Blind putting the 3D models through a trial run at a recent visit to the Museum
    2. National Museum in Bloemfontein

      See museum website here https://nasmus.co.za/

    1. accumulate potent phytotoxins

      Inspiring the film "The Birds", which you can read more in the Sanger Institute blog https://sangerinstitute.blog/2020/05/04/prising-open-the-scallop-genome/

    2. Katherine James Natural History Museum, Department of Life Sciences,Cromwell Road, London SW7 5BD, UK Search for other works by this author on: Oxford Academic Google Scholar Katherine James, Emma Betteridge Wellcome Sanger Institute, Cambridge CB10 1SA, UK Search for other works by this author on: Oxford Academic Google Scholar

      This Q&A features some discussion of her contribution to this project

  21. Apr 2020
    1. Wellcome Sanger 25 Genomes Project

      This project's goal is to sequence 25 novel genomes representing UK biodiversity, as part of the Wellcome Sanger Institute's wider 25th Anniversary celebrations. See https://www.sanger.ac.uk/science/collaboration/25-genomes-25-years

    1. interspecies F1 hybrid of yak (Bos grunniens, NCBI:txid30521) and cattle (Bos taurus, NCBI:txid9913)

      In Tibet this type of yak-cow hybrid is know as a "dzo" (མཛོ)་ https://en.wikipedia.org/wiki/Dzo

    2. Timothy P L Smith US Meat Animal Research Center, US Department of Agriculture, State Spur 18D, Clay Center, NE 68933, USA Correspondence address. Timothy P. L. Smith, US Meat Animal Research Center, US Department of Agriculture, Clay Center, NE 68933, USA. E-mail: tim.smith2@usda.gov   http://orcid.org/0000-0003-1611-6828 Search for other works by this author on: Oxford Academic Google Scholar Timothy P L Smith

      See the Q&A with Benjamin Rosen and Timothy Smith in GigaBlog for more insight http://gigasciencejournal.com/blog/dna-day-2020-cattle-reference-genome/

    1. progression of respiratory diseases

      Including COVID-19, as it is estimated 50% of patients with COVID-19 who have died had secondary bacterial infections. Watch the COSMIC project looking at metagenomics of respiratory samples to identify the bacteria, fungi, and viral co-infections present in patients with COVID-19 https://www.covid-coinfections.org/t/cosmic-co-infections-and-secondary-microbial-infections-in-covid-19/17

    1. Supplemental File 1: Extended Chinese language (中文版) version on the editorial.

      See also a Chinese language adaptation of this statement in Bull. Ntnl. Nat. Sci Foundation China. http://www.cnki.net/kcms/doi/10.16262/j.cnki.1000-8217.2018.06.001.html

    2. A version of the editorial translated into Chinese is included as a Supplementary File

      See also a Chinese language adaptation of this statement in Bull. Ntnl. Nat. Sci Foundation China. http://www.cnki.net/kcms/doi/10.16262/j.cnki.1000-8217.2018.06.001.html

    3. Here, we help clarify this and also provide a clear statement of our expectations around how authors are assigned to manuscripts submitted to GigaScience.

      A more detailed version of this clarification and background is available via our blog: http://gigasciencejournal.com/blog/appropriate-authorship/

    4. Laurie Goodman

      ‡ Senior author

  22. Mar 2020
    1. which is the basis of our planned second release (PLINK 2.0).

      See the homepage for updates taking it towards PLINO 2.0 alpha https://www.cog-genomics.org/plink/2.0/

      We also have phased and annotated data for use in plink2.0 worked examples in GigaDB http://dx.doi.org/10.5524/100516

    1. 3.Li Z, Barker MS. Inferring putative ancient whole genome duplications in the 1000 Plants (1KP) initiative: Access to gene family phylogenies and age distributions. bioRxiv. 2019:735076. https://www.biorxiv.org/content/10.1101/735076v1.

      A peer reviewed and updated version of this has now been published in GigaScience https://doi.org/10.1093/gigascience/giaa004

    2. chlorophyte green algae

      Some of the authors have recorded a podcast discussing the implications for algae research from this data https://podcasts.apple.com/us/podcast/gane-ka-shu-wong-michael-melkonian-on-if-algae-can/id1420197433?i=1000458893924

    1. hypothes.is (use the hashtag/tag #chromosomenomenclature)

      Please add comments directly on the key parts of the commentary you would like to raise any issues with.

    1. Table S3. Representative applications of genome editing. A summary of the representative applications in different organisms.

      Using hypothes.is this information can also be updated via annotations here. e.g. adding mention of Twist biosciences, whose Oligo pools are utilized in many CRISPR applications including generation of CRISPR guide RNA (sgRNA) libraries. See https://www.twistbioscience.com/products/oligopools

    2. Table S1. Online tools for TALEN and CRISPR/Cas9. Collected online tools for TALEN and CRISPR/Cas9 are presented in this table. Updates can be accessed in GitHub [107]. Table S2. Commercial service for TALEN and CRISPR/Cas9. Collected commercial service for TALEN and CRISPR/Cas9 are presented in this table. Updates could can accessed in GitHub [107]. Table S3. Representative applications of genome editing. A summary of the representative applications in different organisms.

      Given that new methods, kits, and services continue to be rapidly developed and updated, an editable version we set up on Github wiki, and readers encourage to update it. See https://github.com/gigascience/paper-chen2014/wiki

    1. This must be achieved by sequencing and archiving huge numbers of microbial genomes, both from clinical cases and known environmental reservoirs, on a continual basis.

      Even without reference genomes, mining metagenomes for coronavirus sequences has become particularly topical in 2020. See the Pangolin 2019-nCoV-like coronavirus example https://doi.org/10.1101/2020.02.08.939660

    2. swine flu

      Jennifer Gardy discusses the groundbreaking H1N1 crowdsourcing efforts in her TEDx talk here (with lots of lessons for the coronavirus outbreak a decade later) https://www.youtube.com/watch?v=LmAugMSJ1-Y

    3. Escherichia coli O104: H4

      See more in GigaBlog about the novel "tweenome" method of datasharing for this project http://gigasciencejournal.com/blog/notes-from-an-e-coli-tweenome-lessons-learned-from-our-first-data-doi/

    1. MERS coronavirus

      Mining metagenomes for coronavirus sequences has become particularly topical in 2020 (see the Pangolin 2019-nCoV-like coronavirus example https://doi.org/10.1101/2020.02.08.939660)

    2. RNA viruses

      As this works with RNA viruses it has been made part of the "Free access to OUP resources on coronavirus and related topics" collection on the Oxford University Press website https://academic.oup.com/journals/pages/coronavirus

    1. direct RNA sequencing. Despite the scientific relevance of VACV, no LRS data have been generated for the viral transcriptome to date.

      This approach of using Oxford Nanopore direct-RNA sequencing for viruses has now been carried out on the SARSCov2/COVID19 causing coronavirus. See https://doi.org/10.1101/2020.03.05.976167

  23. Feb 2020
  24. Oct 2019
    1. African eggplant

      Also know as the scarlet eggplant or bitter tomato.

    2. “orphan crop”

      The African eggplant is a good example of the work of the Africa Orphan Crop consortium and many of the authors are consortium members. You can read more on the first genomes released in GigaBlog here: http://gigasciencejournal.com/blog/democratising-data-aocc/