2,340 Matching Annotations
  1. Jul 2025
    1. scaffold information generated by Bambus 2 allows us to integrate multiple sources of information and obtain more accurate annotations of the resulting assembly
    2. provide additional functionality made possible by the integration of different analyses

      Need to understand details of this: What specific integration does MetAMOS really do?

    3. INSTALL script. This will automatically configure the pipeline to run within the user's environment and also fetch all required data

      data => databases?

    1. Is a multi-centre study evaluating the use of nanopore-16S for clinical microbial detection using shared mock samples (looking for consitency, LODs etc..?)

      This study does nanopore on 16S. Compares two bioinformatic pipelines and uses Emu

      Todd: Emu holding its own against a commercial tool, fewer species classified (likely DB issue) but better precision wrt discriminating species

      • Only shortcoming is that the Emu pipeline (GMS-16S) classified fewer species

        • Todd says this is likely a database issue.

        • Can be fixed when implementing #SOMAteM?

        • Check methods for details on the Emu pipeline: “Bioinformatic data analysis and identification of pathogen

      Evaluation of two bioinformatic pipelines: 1928-16S and GMS-16S<br /> The performance of two separate bioinformatic pipelines were compared: the commercial 16S pipeline developed by 1928 Diagnostics (1928-16S) and the gms_16S bioinformatics analysis pipeline that uses the EMU classification tool (GMS-16S). Overall, 1928-16S identified a higher number of species in comparison to GMS-16S (Supplementary FigS2, Supplementary file 2 and 3). However, significant differences were observed at species level, particularly for Streptococcus and Staphylococcus. GMS-16S demonstrated high accuracy of species level classification, effectively discriminating S. intermedius from S. anginosus in sample G4, as well as separating S. aureus from Staphylococcus argenteus in sample Q3 (Fig. 3a). GMS-16S also more accurately classified members of the Enterobacteriaceae family (Q7, Q5), and was able to identify Serratia marcescens at species level with greater precision in sample Q1 compared to 1928-16S. Conversely, 1928-16S classified a larger proportion of reads as C. acnes in sample G6 (laboratory k), whereas GMS-16S distributed the reads between C. acnes and the closely related C. namnetense.

      <annotations in Public group>

    2. commercial 16S bioinformatic pipeline from 1928 Diagnostics (1928-16S) was evaluated and compared with the open-sourced gms_16S pipeline that is based on the EMU classification tool (GMS-16S).

      Emu is more accurate ; Todd is happy :)

      • more annotations in Public group
    1. RapidONT, a workflow designed for cost-effective and accessible WGS-based pathogen analysis

      Includes both a lab protocol and bioinformatic pipeline

    1. Assembly graphs produced by different tools from the same data may differ significantly, posing a challenge to tools for downstream processing tasks

      This could be a useful tool to integrate post assemblies if it improves compatibility with subsequent tools such as plasmid binning in #SOMAteM


      (not relevant, since this paper solves this issue) How can the LLM help solve this by suggesting the correct downstream tool or by converting outputs to be compatible?

    1. choice of the right algorithm for a given dataset has become difficult due to numerous comparative reports on these different assemblers [88, 89]

      What does the choice of algorithm depend on?

    2. major advantage of De Bruijn graphs is that assembled reads contain fewer errors and errors can be easily corrected prior to assembly
    1. Refer to the original/live annotation in Zotero/note

      This tool does something very similar to omi and has lot of desirable qualities + evaluation methods we can learn from. #omi-relevance

      What it can do

      SpatialAgent employs adaptive reasoning and dynamic tool integration, allowing it to adjust to new datasets, tissue types, and biological questions. It processes multimodal inputs, incorporates external databases, and supports human-in-the-loop interactions, enabling both fully automated and collaborative discovery

      tasks such as gene panel design, cell and tissue annotation, and pattern inference in cell-cell communication and pathway analysis

  2. amos.sourceforge.net amos.sourceforge.net
    1. small, circular nature of the mitochondrial genome allows reads to span the start and end positions, leading to incomplete exclusion of mtDNA
    1. MADRe, a modular and scalable pipeline for long-read strain-level metagenomic classification, enhanced with Metagenome Assembly-Driven Database Reduction.
    2. contig-to-reference mapping reassignment based on an expectation-maximization algorithm for database reduction,

      EM method similar to EMU?

    3. mapping-based tools such as MetaMaps [24], PathoScope2 [25], EMU [26] and MORA [27], which rely on read alignments and reassignment algorithms, offer higher precision at a greater computational cost.
    4. range of metagenomic classification tools have been developed, which can be broadly categorized into marker-based, DNA-to-protein and DNA-to-DNA approaches, as described in [4].
    5. K-mer-based tools such as Kraken2 [14], KrakenUniq [15], Bracken [16], Centrifuge [17], CLARK/CLARKS [18, 19], Ganon [20, 21], Taxor [22], and Sylph [23] are known for their speed and scalability to large databases, but often trade precision for speed

      This whole paragraph has good knowledge that can be incorporated into LLM-RAG? - can ask user about their need for speed!? vs accuracy

    6. MADRe achieves high precision and strain-level resolution while maintaining lower memory usage and runtime compared to existing tools
    1. assembly tools remain prone to large-scale errors caused by repeats in the genome, leading to inaccurate detection of AMR gene content
    2. the fact that multiple consecutive genes lie within a single read to construct gene-space de Bruijn graphs where the k-mer alphabet is the set of genes in the pan-genome of the species under study
    3. reads corresponding to different copies of AMR genes can be effectively separated based on the genomic context of the AMR genes, and used to infer the nucleotide sequence of each copy
    1. We present Autocycler, a command-line tool for generating accurate bacterial genome assemblies by combining multiple alternative long-read assemblies of the same genome
    2. Autocycler builds a compacted De Bruijn graph from the input assemblies, clusters and filters contigs, trims overlaps and resolves consensus sequences by selecting the most common variant at each locus
    1. To migrate this code to DSL2, you need to move all of your channel logic throughout the script into a workflow definition

      seqscreen was writtein in DSL1, needs to be migrated (Todd)

    1. driving the development of community-centric tools on Seqera.io, empowering scientists worldwide to leverage modern software capabilities on demand
    1. Programmed with a deep understanding of Nextflow, common bioinformatics tools, and the overarching scientific community.

      by "overarchinve scientific community" do you mean some discussions on nf-core forums?

    2. has deep knowledge of the errors

      What could be the source of this knowledge? - Maybe a human in the loop training with automated code gen + linter use? - Grazing on forums?

      able to identify the root cause of errors, help troubleshoot, and suggest edits

    3. not only give you the initial conversion, but also run the stages of the code that it generates with sample data and iteratively correct any code that yields runtime errors
    4. convert a pipeline from Bash/CWL/WDL to Nextflow

      use cases

      can not only give you the initial conversion, but also run the stages of the code that it generates with sample data and iteratively correct any code that yields runtime errors

    5. Seqera AI – a bioinformatics agent purpose-built for the scientific lifecycle

      Seqera-AI can - Suggest pipelines (tested and validated) - Answering bioinformatics questions with context - Generate nextflow code + validate/self-correct (when would someone use this?)

      context retrieved: - Can retrieve context for writing and testing nextflow code - context of pipeline results to aid interpretation

      source: Summarized from text below

    1. importance of improved educational programs aimed at biologists and life scientists that emphasize best practices in data engineering
    2. increased theoretical and empirical research on data provenance, error propagation, and on understanding the impact of errors on analytic pipelines
    3. we focus specifically on concerns that lie at the interface of biological data and computational inference with the goal of inspiring increased research and educational activities in this space
    1. how to best benefit from recent advances in AI and how to generate, format and disseminate data to enable future breakthroughs in AI-guided drug discovery
    1. When given well-crafted instructions, these chatbots hold the potential to significantly augment bioinformatics education and research
    2. role prompting that assigns a role to the chatbot, few-shot prompting that provides relevant examples, and chatbot self-reflection that improves responses based on task feedbacks
    1. In addition, varying study designs will require project-specific statistical analyses.

      how is this addressed? - helpful for #SOMAteM

    2. use of isolated Conda environments for Hecatomb minimizes package version conflicts, minimizes overhead when rebuilding environments for updated dependencies, and allows maintenance and customization of different Hecatomb versions.
    3. While Hecatomb is a Snakemake pipeline, it uses the Snaketool command line interface to make running the pipeline as simple as possible [95]. Snaketool populates required file paths and configuration files, allowing Hecatomb to be configured and run with a simple command
    1. An opt-in feature for now, strict syntax enables consistent behavior between the Nextflow CLI and language server, and enables numerous new features
    1. This new specification enables more specific error reporting, ensures more consistent code, and will allow the Nextflow language to evolve independently of Groovy.
    2. strict syntax will eventually become the only way to write Nextflow code, and new language features will be implemented only in the strict syntax
    1. omi feature idea: minor CLI tools - not pipelines

      • Thought process: What does this tool need as input: MSA.

      • Can this CLI tool make the MSA as well if the user tells it stuff? That’s too specialized -- would be nice to make an LLM tool like omi for that though

      • I think omi can beat seqera AI and chatGPT in this space where we identify and wrap essential CLI tools to be run by text prompts

      • Leave the nextflow part to seqera AI :: if it’s good enough for running pipelines

    1. found that multi-scale containerization, which makes it possible to bundle entire pipelines, subcomponents and individual tools into their own containers, is essential for numerical stability
    2. The dataflow model is superior to alternative solutions based on a Make-like approach, such as Snakemake16, in which computation involves the pre-estimation of all computational dependencies, starting from the expected results up until the input raw data
    3. Although the graphical user interface (GUI) in Galaxy offers powerful support for de novo pipeline implementation by non-specialists, it also imposes a heavy development burden because any existing and validated third-party pipeline must be re-implemented and re-parameterized using the GUI.
    1. Configuration parameters are loaded one after another and overwrite previous values. Hardcoded pipeline defaults are first, then the user’s home directory, then the work directory, then every -c file in the order supplied, and finally command line --<parameter> options.
    1. If you wish to repeatedly use the same parameters for multiple runs, rather than specifying each flag in the command, you can specify these in a params file. Pipeline settings can be provided in a yaml or json file via -params-file <file>.
    2. Differential abundance analysis for relative abundance from microbial community analysis are plagued by multiple issues that aren’t fully solved yet. But some approaches seem promising
    1. Improvements to NanoPlot and NanoComp are, among code optimizations, the generation of additional plots, using dynamic HTML plots from the Plotly library, and enabling further exploration by the end users
    2. Chopper is a tool that combines the utility of NanoFilt and NanoLyse, for filtering sequencing reads based on quality, length, and contaminating sequences, delivers a 7-fold speed up compared to the Python implementation, making use of the Rust-Bio library
    1. For Nextflow DSL2 nf-core pipelines - parameters defined in the parameter block in custom.config files WILL NOT override defaults in nextflow.config! Please use -params-file in yaml or json format in these cases:
    1. Several new tools have recently been developed to leverage long-reads for taxonomic profiling

      Long-reads to taxonomic profiling approaches - k-mer based: Kraken 2, Sourmash <br /> - read-mapping to index: Centrifuger, MetaMaps.. - Marker genes: Melon, PhyloSift, ..

    2. Our results indicate that Lemur can efficiently process large datasets within minutes to hours in limited computational resource settings.
    3. Lemur and Magnet have limitations that vary by use case. Reliance on bacterial marker genes necessarily implies it cannot generalize to viral genome classification
    4. reliance on the marker genes makes it less sensitive than alternatives like Kraken 2 or MetaMaps, which use all long reads and complete genomes.
    5. The EM algorithm begins by initializing F (t) to the uniform distribution and initializing P (r|t) for each read and taxon pair (r, t).
    6. The goal of Magnet is to detect and remove potential false positives by performing competitive read alignment leveraging all of the reads mapped against the entire reference genome
    7. Lightweight tools for taxonomic profiling: Presence/ absence + abundance estimation

      • Lemur: Marker gene based ; uses EM (similar to Emu)

        • Takes raw reads and creates an abundance estimate
      • MAGnet: whole genome, map reads to reference genome

        • Takes the abundance estimate + raw reads and removes false positive calls with a threshold (ANI, mapping quality) of alignment to representative genomes from clustering
    1. We introduce a partitioning option to Parsnp, which allows the input to be broken up into multiple parallel alignment processes which are then combined into a final alignment
    1. improves our previous method, MHG-Finder, by utilizing a guide tree to significantly improve scalability and provide more informative biological results
    2. A maximal homologous group, or MHG, is defined as a maximal set of maximum-length sequences whose evolutionary history is a single tree
    1. processes such as horizontal gene transfer or gene duplication and loss may disrupt this homology by recombining only parts of genes, causing gene fission or fusion
    1. Structural variants (SVs), genomic alterations of 10 base pairs or more, play a pivotal role in driving evolutionary processes and maintaining genomic heterogeneity within bacterial populations
    2. Recent work utilizes coassembly graphs for metagenomes to decompose strain diversity into haplotypes (30), but to the best of our knowledge, this is the first time coassembly graph patterns have been used for automated detection of SVs in a metagenome series.
    3. In isolate genomics, the goal of SV detection is relatively straightforward: detect long genomic differences between a sequence and reference genome that can be classified as an insertion, deletion, inversion, duplication, translocation, or any combination
    1. You can install modules from nf-core/modules in your pipeline using nf-core modules install. A module installed this way will be installed to the ./modules/nf-core/modules directory.
    1. The use of Conda recipes specified using the conda directive needs to be enabled explicitly in the pipeline configuration file (i.e. nextflow.config):
    1. Any channel in the workflow can be assigned to an output, including process and subworkflow outputs. This approach is intended to replace the publishDir directive.

      I guess this is to publish important files and exclude intermediate ones?

    1. We prefer to be explicit to aid code clarity, as such the $it syntax is discouraged and will slowly be phased out of the Nextflow language.
    1. a process will emit value channels if it is invoked with all value channels, including simple values which are implicitly wrapped in a value channel.