I think that certain infrastructure improvements could be made to make this more user-friendly, stable, and adhere to best software engineering practices through implementing tests, better versioning of individual software packages that are included, etc. These are a few problems and potential solutions I see:
1) This pipeline requires managing very large conda environments, which can get out of hand very quickly in addition to potential difficulties with installation and solving environments. If the authors would like to stay with conda environments, a quick solution to solving environments and quicker installation would be using mamba to put these environments together.
2) Since the pipeline is written as a series of bash/R/python scripts depending on conda environments, the pipeline is somewhat fragile and hard to ensure it works on most infrastructures, or even the intended infrastructure. Even if the actual installation process is made smoother, there is still the problem of verifying what versions of tools that were used in the pipeline. There is a way to export the conda environments and versions, but it's not a perfect solution. I think an involved pipeline like this would greatly benefit from being executed with a workflow manager such as Snakemake or Nextflow, with my personal opinion being that is should be implemented in Nextflow. Although Snakemake is easier to learn and can implement conda environments easier, it's difficult to ensure these pipelines will work on diverse platforms. Nextflow can also use conda environments but there is preference for Docker or singularity images, which solves some of the issues with keeping track of versions. Additionally Nextflow has testing and CI capability built in so that ensuring future updates are still functional and work as expected is easier. Finally, Nextflow has been tested on various platforms - from HPC schedulers, local environments, to cloud providers.
3) Related to the issue above, I don't see how this pipeline can be run in a high-throughput way because it isn't written as a DAG like what is implemented in Snakemake/Nextflow pipelines. My understanding is that you would have to run all of the samples together in more of a "for loop" fashion, and therefore this doesn't take advantage of HPC or cloud resources one might have. The only way somebody could use this in the cloud is if they used a single EC2 instance, which isn't very cost or time efficient. Making the pipeline truly high-throughput so samples can be run in parallel for certain tasks and then aggregated together requires DAG infrastructure.