On 2019-07-18 20:58:17, user Morten S. Dueholm wrote:
Dear Robin,
We appreciate very much that you have taken the time to review our manuscript. We are happy that you can see the potential of the approach and we appreciate very much your comment and suggestions, some of which we surely will implement in our manuscript before submission. We have provided comments to the points raised in your review below in italic. If you have further comments or questions, you are very welcome to get back to us.
Best regards,<br />
Morten and Per
AutoTax BioRxiv Review, Jul 03, 2019, Robin Rohwer
This manuscript describes a new method, AutoTax, to create databases out of full-length 16S rRNA gene sequences that can be used for taxonomy assignment of 16S rRNA gene amplicon data. This is a valuable contribution, because availability of full-length 16S rRNA gene sequences is increasing, and ecosystem-specific databases improve classification of amplicon data. Combining high-throughput generation of full-length 16S rRNA gene references with high-throughput database creation would be a major advancement. This manuscript also describes the application of AutoTax databases to the design of FISH primers, but I will focus my review on taxonomy assignment because that is my area of expertise.
I believe this tool will be a valuable resource, which is why I have taken the time to review it carefully. I hope these comments will be addressed in the published version of this paper. I have three main concerns with this manuscript.
First, it needs more detailed descriptions of the underlying methods that AutoTax uses for alignment and identity calculations. The manuscript refers readers to a github repo and supplemental documents for details, but this manuscript is introducing a new method so the algorithms/tools it employs should be clear in the main text.
We have elaborated on the methods in the main manuscript, however we keep the detailed information on settings etc. in the supplementary as we believe that this information is only relevant for a small subset of the target audience and such information would remove focus from the practical aspects.
Second, the taxon levels in the AutoTax database are assigned based on identity thresholds. The authors acknowledge briefly that this means their AutoTax databases lack phylogenetic information, but this is a major limitation so they should justify in more detail why they chose this method and why the resulting taxon names are still meaningful. They also cite a 2018 paper by Edgar (https://peerj.com/articles/... several times to support other claims, but do not mention that the main finding of this paper is that percent identity is a poor predictor of taxon level.
Many of our sequences have close relatives in the SILVA database and obtain their taxonomy directly from this reference database. These taxa are therefore supported by phylogenic information. The denovo taxa are constructed based on sequence identity alone. This is a simple solution, which compared to phylogenetics, can be reproduced. Phylogenetics is especially problematic with the large dataset, which will become the standard in the future, as heuristic approaches are required for processing the data.
The conclusion that "Percent identity is a poor predictor of taxon levels" in the 2018 paper by Edgar relates to V4 amplicons, which are known to have sparse phylogenetic information. That is why we in our studies only perform clustering of full-length 16S rRNA sequences, for which taxon thresholds are supported by statistics.
Third, the importance of AutoTax would be more clear if the manuscript discussed how it fits in with similar work on improving taxonomy classifications. Many previous studies (examples below) have also found improvements in taxonomy assignment when databases are improved, yet here this is presented as a novel finding. Many other methods also exist for creating custom databases, yet they are also not discussed. I believe AutoTax is a novel contribution, but its value will only be clear and meaningful in the context of previous work.
We do not consider that “better databases provide better classifications” is a novelty. This is already clear from our previous versions of the MiDAS database (and other databases), which is a manually curated version of the SILVA database. However, we will include some of the references in our manuscript as support for the need for improved taxonomy.
Specific Comments:
line 68- Great explanation of why taxonomy is needed for cross-study comparisons, there is a common confusion that ASV's have solved the problem.
line 102- Since you are introducing an alternative tool to build databases, you should include more detail on the existing ways that databases are created and the comparative improvements and drawbacks when using AutoTax. (Perhaps in discussion instead of right here in introduction.) For example, how does AutoTax compare to the results from SINA (https://academic.oup.com/bi... and RAxML (https://academic.oup.com/bi... or FastTree (https://journals.plos.org/p..., or to manual curation using Arb (https://academic.oup.com/na...
The references above do not link to tools which can be used to build databases, instead they represent tools that can be used to align/classify sequences (SINA) or infer the phylogenetic relationship of sequences (RAxML, FastTree, arb). We have previously used the SINA for sequence alignment and calculation of percent identity, however the algorithms in SINA is not suited for this purpose (see discussion with the author: https://github.com/epruesse..., which is why we decided to use usearch.
line 109- Instead of (or in addition to) citing the original field guide (your ref #22), please cite TaxAss as the reference for the FreshTrain. The original Newton citation is appropriate when referencing the phylogeny, vocabulary, or ecology of the included Freshwater bacteria, but the TaxAss citation is appropriate when referencing the current updated version of the database that can be used for taxonomy assignment. (https://msphere.asm.org/con...
We will add this reference along with the one for the original FreshTrain database.
line 123, line 384- As you re-make your custom AutoTax taxonomy with a new version of SILVA, how does AutoTax prevent changes to the ecosystem-specific names in the first AutoTax version? It is clear in line 384 that you can add more custom sequences without messing up the existing taxonomy, but what if some of your sequences are added to SILVA itself? Couldn't that then shift all of your ESV centroids? And if you avoid that shift by leaving the duplicated sequence in the custom set, wouldn't AutoTax then ignore any added phylogenetic information that came from the additional SILVA curation, since preference is given to ESV centroids in the case of conflicts?
Firstly, we do not include any of the sequences from SILVA into our ecosystem-specific database. When our own full-length sequences are added to SILVA this will therefore not have any influence on our ESV numbering and identification of ESV centroids. However, when the sequences are added we will get better support for the taxonomy assignment by SILVA, which will improve classification. So with time, when our’s and other’s full-length high quality sequences are added to SILVA, this will help to improve the taxonomic classification provided by SILVA and thus also improve our ecosystem-specific database.
line 129- I found mixing the abbreviations ASV and ESV very confusing. For example, ASV's could also be considered "exact" sequence variants, since they are unclustered, and to add to the confusion you DO cluster the ESVs later in the method. Choosing different terms would help clarify when you are talking about full-length vs. amplicon sequences.
We understand that it may not be clear from the abbreviations that ESVs refer to full-length 16S rRNA sequences, whereas ASVs refer to shorter amplicons. To avoid confusion we will use the term full-length ESVs in the manuscript.
line 200- This "in brief" description of methods is inadequate, especially for a paper that is introducing the method for the first time. What algorithm does the script use to identify an ESV's closest relative- RDP classifier, SINTAX, BLAST? What algorithm does the script use to obtain taxonomy? What algorithm is used to calculate sequence identity? These choices have major impacts on your results, and for a paper that is introducing a method the workings of the method should not be hidden in supplemental materials or within code.
We use a comprehensive usearch (-maxrejects 0) for identification of the closest relatives in the SILVA database and for calculation of the percent identity. We do not use any classifiers. We will make this more clear the in the next version of the manuscript.
line 332- I would believe the results of a kmer-based method like the standard RDP/Wang classifier more than sequence identity. You are using the full SILVA database in your classification, so overclassification should not be a major problem. If you are worried overclassifications might be masking the gains of your method, you can double check by looking at how many classifications change when you use the new database. The main point of your reference #10 is that sequence identity is a poor predictor of taxonomic rank, and Edgar even states in the abstract that "95% identity was found to be a twilight zone where taxonomy is highly ambiguous because the probabilities that the lowest shared rank between pairs of sequences is genus, family, order or class are approximately equal."
Sequence identity is actually a fairly good predictor of taxonomic ranks, when full-length 16S rRNA sequences are analyzed (Yarza et al. 2014, Kim et al. 2014). The statement that "95% identity was found to be a twilight zone where taxonomy is highly ambiguous because the probabilities that the lowest shared rank between pairs of sequences is genus, family, order or class are approximately equal." relates to V4 amplicons and not full-length sequences. Another key point is that the NCBI was used as the ground truth, and we know that this database contains many errors according to both conserved marker gene phylogenetics and ANI (Park et al. 2018, Ciufo et al. 2018).
Yarza P, Yilmaz P, Pruesse E, Glöckner FO, Ludwig W, Schleifer K-H, et al. (2014). Uniting the classification of cultured and uncultured bacteria and archaea using 16S rRNA gene sequences. Nat Rev Microbiol 12: 635–645.
Kim M, Oh H-SS, Park S-CC, Chun J. (2014). Towards a taxonomic coherence between average nucleotide identity and 16S rRNA gene sequence similarity for species demarcation of prokaryotes. Int J Syst Evol Microbiol 64: 346–351.
Parks DH, Chuvochina M, Waite DW, Rinke C, Skarshewski A, Chaumeil P-A, et al. (2018). A standardized bacterial taxonomy based on genome phylogeny substantially revises the tree of life. Nat Biotechnol 36: 996–1004.
Ciufo S, Kannan S, Sharma S, Badretdin A, Clark K, Turner S, et al. (2018). Using average nucleotide identity to improve taxonomic assignments in prokaryotic genomes at the NCBI. Int J Syst Evol Microbiol 68: 2386–2392.
line 367- What method specifically is used to "map" the ESV's?
We used usearch11 -usearch_global -maxrejects 0 -id 0. This information is provided in the materials and methods.
line 378- How do you know these classifications are correct when there is nothing to compare them to and no way to test for accuracy? Perhaps you could change it to "This approach will distinguish species-level classifications."
Excellent point. This part does not include any taxonomy. We will rewrite or remove this sentence.
line 387- When you defer to a centroid name over Silva, what happens to the Silva name? Is it lost in favor of the placeholder name? Could this result in a situation where known organisms are missed because they get classified as a new placeholder name instead?
If the centroid of e.g. a species falls outside the criteria for a SILVA genus, all sequences within that species will obtain a denovo name, even though some sequences may fall within the threshold. However, there will most likely also be species, whose centroids fall within that genus, in which case it will remain. AutoTax creates a log of all these changes, which means that you can search for missing taxa if relevant.
line 395- This "striking" finding is not novel. It is well supported in the literature that custom databases will dramatically improve classification and you should include that context when you describe your result. See for example: <br />
https://msphere.asm.org/con..., Fig 3<br />
http://journals.plos.org/pl..., Fig 5<br />
https://doi.org/10.1186/147..., Table 1<br />
http://dx.plos.org/10.1371/..., Table 4<br />
https://peerj.com/articles/494, Fig 3<br />
https://academic.oup.com/da..., Table 1<br />
http://www.sciencedirect.co..., Fig 6<br />
http://www.biomedcentral.co..., Fig 4<br />
https://peerj.com/articles/..., Fig 1
Thanks for providing these references, they support why the method is relevant and will be included in the article.
line 410- The improvement over MiDAS is one of your most compelling findings. You should elaborate on it more! For example, a little background on how MiDAS was created for your non-wastewater audience, and then you can emphasize how the volume of sequences is really important. This is the most compelling evidence for scientists to adopt your method when they already have small custom databases, and it is also the most appropriate test of improvement since MiDAS is the current standard for activated sludge community classifications.
We will elaborate on this in the manuscript. The MiDAS database is actually a curated version of SILVA, which includes all SILVA sequences, so it is not a "small" custom database.
line 417- This analysis is certainly useful to the wastewater research community, but it tests the resolution of primer regions, not the validity of database. To test how your database performs, you should use known amplicon sequences that do not already have an exact match in the database. For example, by creating amplicons from unincluded ESV's.
We disagree with this. We show in the paper that we can make a database, which includes near-perfect references for almost all bacteria in an ecosystem. Therefore, it makes sense to test the resolution of the primer regions when there is an exact match in the database.
line 490- How can you state that the AutoTax databases are near-complete when you haven't performed any completeness estimates? They are certainly "more complete"...
We actually determine the completeness of the database in Figure 1. The results are based on mapping of amplicon data to the ESVs database and calculate how many of the amplicons have high-identity references in the database. The analysis is of course biased by the amplicon primers, but it is the best we can do.
line 491- These public databases are not "much larger" than your database because your AutoTax database combines your new sequences with Silva, and this combination is therefore slightly bigger than Silva. This statement could be misleading to a less careful reader, because some custom databases are used alone, without being combined with Silva/Greengenes.
We do not include any sequences from SILVA into our database. SILVA is only used for classification. The statement is therefore correct. We will make this clearer in the manuscript.
line 496- How can you claim sub-species level classifications? You have explicitly stated that AutoTax uses a 7-level taxonomy.
We are able to resolve multiple ASVs for each species. Therefore, we clam that the microbial community can be analyzed at the sub-species level.
line 506- I like seeing a time estimate, but it is meaningless without some broad description of the computational platform used. "A few hours," is great, but was that on a standard laptop or a high throughput computing center?
This is important and relevant information and we will add it to the manuscript.
line 513- "Although the sequence similarity clustering does not necessarily reflect the true evolutionary history of the microbes or the phenotypic characteristics..." This is the biggest weakness of your method, and a major concern. It deserves more in-depth discussion. For example, you show improvement over a smaller custom database (MiDAS), but you define improvement based on how many ASVs were named. Is it really an improvement to end up with more names if those names are less meaningful? How valuable is a placeholder name when it lacks phylogenetic context? Also, you need to discuss the limits of sequence identity for defining taxonomic rank. You cite a paper by Edgar (ref # 10) multiple times, yet do not discuss its main finding that sequence identity is a poor predictor of taxonomic rank.
Many of point here have been discussed above. An important point is that AutoTax uses the most recent phylogenetic information when possible (the SILVA taxonomy). The placeholder names serve as robust reference points until true taxonomies are made. They have the same diversity as true taxa and are therefore good substitutes until the taxonomy has been curated by phylogenetic experts or by genome-based methods such as those used in the genome-based taxonomic database (GTDB) and are being used to curate the NCBI taxonomy. The denovo names will be replaced by true taxonomic names as the databases are curated. We can easily keep track of these chances then AutoTax based databases are updated.
line 530- "...will provide a common language for scientific communities..." How will AutoTax accommodate existing and future manually curated taxonomies that include phylogenetic information? How do you prevent dueling frameworks, the "more complete" vs. the "more correct." Can AutoTax be incorporated into existing manual curation efforts, or is it purely a separate approach?
See comment above.