In the strict syntax, variables must be declared with def and must not specify a type:
- Jul 2025
-
-
-
www.nextflow.io www.nextflow.io
-
each row is simply a list of columns.
-
-
training.nextflow.io training.nextflow.io
-
splitCsv() reads each line into an array, and each comma-separated value in the line becomes an element in the array
-
-
training.nextflow.io training.nextflow.io
-
output: path "UPPER-${input_file}" script: """ cat '$input_file' | tr '[a-z]' '[A-Z]' > 'UPPER-${input_file}'
how can I minimize the repitition in the output path name in this nextflow process?
-
You know how to collect outputs from a batch of process calls and feed them into a joint analysis or summation step.
-
things could get a little tricky, because we need to be able to handle an arbitrary number of input files. Specifically, we can't write the command up front, so we need to tell Nextflow how to compose it at runtime based on what inputs flow into the process.
-
we're not adding the operator in the context of a channel factory, but to an output channel.
-
-
www.nextflow.io www.nextflow.io
-
A named workflow is a workflow that can be called by other workflows:
-
As a best practice, params should be used only in the entry workflow and passed to workflows and processes as explicit inputs.
-
-
www.biorxiv.org www.biorxiv.org
-
MHG is formed by identifying and grouping all homologous sequences
-
evolutionary events
-
are encapsulated within the same MHG
-
-
training.nextflow.io training.nextflow.io
-
write another pipeline that calls on one of those processes, you just need to type one short import statement to use the relevant module. This is better than just copy-pasting the code, because if later you decide to improve the module, all your pipelines will inherit the improvements.
-
-
-
They encapsulate applications and dependencies in portable, self-contained packages that can be easily distributed. Containers are also key to enabling predictable and reproducible results.
-
Nextflow was one of the first workflow technologies to fully embrace containers for data analysis pipelines.
as opposed to using conda as much as possible before containerization?
-
Today, workflows may comprise dozens of distinct container images. Pipeline developers must manage and maintain these containers and ensure that their functionality precisely aligns with the requirements of every pipeline task.
-
Wave — a container provisioning and augmentation service that is fully integrated with the Nextflow and Nextflow Tower ecosystems.
-
Wave allows developers to manage containers as part of the pipeline itself
-
-
www.biomesense.com www.biomesense.com
-
Our platform combines novel hardware with AI-enabled bioinformatics to unlock the personalized medicine potential of the gut microbiome
I wonder that bioinformatics they are doing that could be useful for omi
-
-
academic.oup.com academic.oup.com
-
Is this relevant to metabolic reconstruction part of SOMATEM pathways?
(private to pbk1:) Read annotations in public group
-
-
www.biorxiv.org www.biorxiv.org
-
Todd says this is no comparison to Sylph. Read later to figure out if interested (this one doesn't cite Sylph, maybe peers?)
(pbk1 private) Read annotations in public group
-
-
www.nextflow.io www.nextflow.io
-
Specifying the Conda environments in a separate configuration profile is therefore recommended to allow the execution via a command line option and to enhance the workflow portability
-
process.conda = 'samtools'
does this mean all tools / processes using conda need to be pre-specified in the conda.profile? seems dumb..
-
You can also download Conda lock files from Wave build pages. These files list every package and its dependencies, so Conda doesn’t need to resolve the environment. This makes environment setup faster and more reproducible.
-
conda '/some/path/my-env.yaml'
What is this path relative to?
-
-
www.nextflow.io www.nextflow.io
-
launch directory
is this the working dir from where nextflow is called from?
-
profiles are applied in the order in which they are specified on the command line.
-
The standard profile is used by default when no profile is specified.
-
Config scopes are used to group related config options
-
-
seqera.io seqera.io
-
The only difference when compared with legacy syntax is that the process is not bound with specific input and output channels, as was previously required using the from and into keywords respectively
-
Another exciting feature of Nextflow DSL 2 is the ability to compose built-in operators, pipeline processes and sub-workflows with the pipe (|) operator
-
mimics the expressiveness of the Unix pipe model
-
-
www.nature.com www.nature.com
-
(Table 3) Might be relevant to decontamination approaches suggested in the SOMATEM pipeline?
(private pbk1) Read annotations in public group
-
due to their inherent complexity and the limited availability of decontamination pipelines compared with those for marker gene datasets
-
-
nf-co.re nf-co.re
-
multiple profiles can be loaded
-
later profiles can overwrite earlier profiles.
-
Please only use Conda as a last resort i.e. when it’s not possible to run the pipeline with Docker, Singularity, Podman, Shifter, Charliecloud, or Apptainer.
Why?: seqera AI chat says nf-core has reproducibility issues with conda: run-time resolution causing hash verification failures. Here's the summary -
When Conda is Still Appropriate
Conda remains useful for:
Development and prototyping: When you need flexibility to update packages Custom/proprietary software: When containers aren't available Resource-constrained environments: Where container overhead is problematic Legacy systems: Where container runtimes aren't available
Best Practice Recommendation
For production workflows, the recommended approach is:
Primary: Use Docker/Singularity containers Development: Use Wave to generate containers from conda specs Fallback: Use conda only when containers aren't feasible Future: Leverage conda lock files for maximum reproducibility
The "last resort" recommendation reflects the hard-learned lessons from managing nearly 1,500 nf-core modules and the practical challenges of maintaining reproducible bioinformatics workflows at scale.
-
-
training.nextflow.io training.nextflow.io
-
Process directives allow the specification of settings for the task execution such as cpus, memory, container, and other resources in the workflow script.
-
it’s strongly suggested to define the process settings in the workflow configuration file instead of the workflow script
-
-
www.nextflow.io www.nextflow.io
-
quick alternative to building Conda packages in the local computer
-
Wave allows the provisioning of containers based on the conda directive used by the processes in your pipeline
-
Seqera Platform access token is not mandatory, but it is recommended in order to access private container repositories and pull public containers without being affected by service rate limits
Usage limits here
-
-
www.biorxiv.org www.biorxiv.org
-
FastGA finds alignments between two genome sequences more than an order of magnitude faster
-
stores millions of alignments in a fraction of the space of a conventional CIGAR-string
-
using a trace-point encoding
-
We carefully separate the problems of genome alignment and genome homology
-
by not conflating these two steps, FastGA can be used for other downstream tasks such as finding recurrent insertions due to transposable elements
-
The key idea is to reduce the number of k-mers inspected for seed matches by using only those that are minimizers in a window of some small size
-
Lot of claims of speed and memory efficient storage of alignments. read more later ignoring the details that are too technical
-
-
www.nextflow.io www.nextflow.io
-
it is a way to perform a dry-run
-
provide a dummy script that mimics the execution
-
ou can define a command stub, which replaces the actual process command when the -stub-run or -stub command-line option is enabled:
-
-
www.nature.com www.nature.com
-
In this Review, we examine the main machine learning concepts, tasks and applications that are relevant for experimental and clinical microbiologists
-
provide the minimal toolbox
-
understand, interpret and use machine learning
-
-
www.biorxiv.org www.biorxiv.org
-
investigated the concordance between the species and genus level calls across the tools
-
we built a new marker gene database with 43 markers for bacteria+archaea and 48 markers for fungi
-
We then built the database using recent versions of NCBI RefSeq: version 221 for both bacteria (329,194 assemblies) and archaea (1,911) and version 222 for fungi (564)
https://zenodo.org/records/10802546
Need to create a reproducible process/script to update the database with newer versions of NCBI RefSeq!
-
The final database was 4.1 GB, containing 3,335,783 sequences.
-
Lemur, a marker-gene-based long-read taxonomic profiler
-
Magnet, a genome-based validation tool for confirming the presence and absence of microbial genomes present in a sample
-
we introduce Lemur and Magnet, a pair of tools optimized for lightweight and accurate taxonomic profiling for long-read shotgun metagenomic datasets
What makes this long-read compatible? The EM (expectation maximization) technique similar to Emu?
-
-
academic.oup.com academic.oup.com
-
BugBuster is a fully containerized, modular, and reproducible workflow implemented in Nextflow. The pipeline streamlines analysis at level of reads, contigs, and metagenome-assembled genomes (MAGs), offering dedicated modules for taxonomic profiling and resistome characterization.
-
Thanks to the use of containers, BugBuster can be deployed with minimal configuration on workstations, high-performance clusters, or cloud platforms
Does this really require containers or can be done with conda as well?
-
-
academic.oup.com academic.oup.com
-
an update to GTDB-Tk that uses a divide-and-conquer approach where user genomes are initially placed into a bacterial reference tree with family-level representatives followed by placement into an appropriate class-level subtree comprising species representatives
-
-
academic.oup.com academic.oup.com
-
Genome Taxonomy Database (GTDB; https://gtdb.ecogenomic.org) provides a phylogenetically consistent and rank normalized genome-based taxonomy for prokaryotic genomes sourced from the NCBI Assembly database.
-
GTDB uses relative evolutionary divergence (RED) to delineate higher-rank taxa and average nucleotide identity (ANI) to delineate species clusters
-
-
www.nextflow.io www.nextflow.io
-
Groovy-style type annotations should be used instead:
-
-
nextflow.io nextflow.io
-
The take: section is used to declare the inputs of a named workflow:
-
-
nextflow.io nextflow.io
-
Statements and script declarations can not be mixed at the same level.
-
-
-
Abstract
Read annotations in public group
-
unlike the linear number of k-mers in a sequence, the number of subsequences grows exponentially
What is k-mer vs subsequence difference?
(
duckduckgo
-AI generated) A k-mer is a specific type of subsequence that consists of a fixed length (k) of nucleotides from a biological sequence, while a subsequence can be any sequence derived from another sequence by deleting some elements without changing the order of the remaining elements. In bioinformatics, k-mers are often used for tasks like DNA sequence assembly and analysis.
-
-
training.nextflow.io training.nextflow.io
-
best practices guidelines enforced by the project further ensure that the pipelines are robust, well-documented, and validated against real-world datasets.
something to learn from and emulate?
-
Convert basic Nextflow modules to nf-core compatible modules
-
-
bmcmicrobiol.biomedcentral.com bmcmicrobiol.biomedcentral.com
-
potential limitation of shotgun sequencing is the complexity of bioinformatics pipelines required for its analysis
This is a great statement for making the Somatem pipeline more accessible
-
scripts were made available for researchers
-
KBase platform [37] offers a user-friendly interface that allows for the analysis of data using most of the tools described in this publication
other GUI tools to benchmark to / comment on?
-
Unexpectedly, k-mer approaches resulted in rather high false positive rates, which may lead to misinterpretations of microbial community composition.
Could this be improved with tool choices and databases? - There are more recent tools than Kraken2-bracken and sourmash for this. - Centrifuger is a recent tool ; and sylph is known to have more stringent / less false positives
-
-
training.nextflow.io training.nextflow.io
-
course demonstrates how to implement a simple variant calling pipeline with GATK (Genome Analysis Toolkit)
-
linear workflow
-
accessory files
-
-
training.nextflow.io training.nextflow.io
-
understand what it does and how it should be configured before attempting to run it
-
ln -s $NXF_HOME/assets pipelines
if
NXF_HOME
is not found, try~/.nextflow
-
nf-core project enforces strong guidelines for how pipelines are structured, and how the code is organized, configured and documented.
-
subworkflows
-
reuse chunks of code across different pipelines
-
flexible while minimizing maintenance burden
-
'utility' or housekeeping subworkflows
-
accessory functions
-
-
nf-co.re nf-co.re
-
Nextflow works best with an active internet connection, as it is able to fetch all pipeline requirements.
-
-
training.nextflow.io training.nextflow.io
-
It may seem like a lot of work to accomplish the same result as the original pipeline, but you do get all those lovely reports generated automatically
-
features of nf-core, including input validation and some neat metadata handling capabilities
-
-
-
unstructured data does not have a predefined data model, it is not easily processed and analyzed through conventional data tools and methods. It is best managed in nonrelational or NoSQL databases or in data lakes, which are designed to handle massive amounts of raw data in any format.
-
-
nf-co.re nf-co.re
-
NCBI taxonomy dump
How to retrieve this?
Download a taxdump.tar.gz file from NCBI servers and extract the names.dmp and nodes.dmp files from it. taxonomizer: sherrilmix.github.io
-
-
www.nextflow.io www.nextflow.io
-
highlights source code in red for errors and yellow for warnings
-
Problems tab. Here, you can search for diagnostics
-
language server parses scripts and config files according to the Nextflow language specification, which is more strict than the Nextflow CLI
-
Include declarations in scripts and config files act as links, and ctrl-clicking them opens the corresponding script or config file.
-
view the definition of a symbol (e.g., a workflow, process, function, or variable),
-
can format your scripts and config files based on a standard set of formatting rules
-
right-click the symbol, select Rename Symbol
-
Format Document command in the command palette
-
-
-
There are three most popular pipelines used for NGSanalyses: QIIME, mothur and MetAMOS
Don't know if this statement is justified given that the number of citations differ by 2 orders of magnitude - Mothur: 20 K - Qiime : 37 K - MetAMOS: 230
-
-
www.pnas.org www.pnas.org
-
Using first-passage analysis validated by Monte Carlo simulations, we quantitatively characterize nucleotide-specific error rates during RNA polymerase II transcription
(comments before reading in full:) Curious how you got all the rates mentioned in Fig 1C. - Appendix table S1 shows most rate constant parameters are fitted; I wonder how they were fitted - The rate constants that were fixed, did you get those from literature..?
-
-
www.ncbi.nlm.nih.gov www.ncbi.nlm.nih.gov
-
New taxa are added to the Taxonomy database as data are deposited for them.
How often should I update this in a classifier like centrifuger?
-
-
www.biorxiv.org www.biorxiv.org
-
Visualization were created with pgfplots
why not ggplot? Since you were using R already..
-
knowledge gap between microbes that are only studied en masse as communities and those select few species whose molecular, genetic, or physiological diversity is studied in detail.
Since you are not capturing species not present in the title/abstract. You would also not capture a future study that employs automated robotics to study multiple organism like you mentioned in the previous paragraph!
-
apply these tools to the myriad species that live in the understudied corners of our world.
Roboticizing microbiology involves moving parts (shaking cultures), changing temperatures etc. and it will be harder to automate the study of understudied species with finicky behaviours. For example, certain streptomyces species (roseosporus) form aggregates if not grown with the proper shaking in a bevelled flask within viscous media with glass beads put in.
How on earth do you automate your way out when you cannot standardize culture conditions for finicky organisms?
-
Statisticians have taught for decades that the most efficient and robust experimental designs vary multiple factors simultaneously and then deconvolve the effects and interactions with simple statistical models
This will be a nightmare in biology with low sample sizes and limited data. You will need a lot more depth of data to de-convolve factors efficiently even with the newer AI methods
-
counted the number of PubMed articles that refer to each species in their title or abstract
What about microbes mentioned in the body of the paper or even tables of supplementary material etc. Are these not significant enough to count as "understanding" these microbes?
New AI based methods would make it possible to scrape such references given contextual keywords etc. that discriminate between casual references vs emphasis enough that the microbe is being "studied".
Also, what does it mean for a microbe to be "understood" anyways? Do these all qualify, and at the same magnitude? 1. Microbiology (culture methods, media, growth rate calculations) 2. Synthetic bio (figuring out regulatory elements.. promoters, RBS and such that enable expressing genes on plasmids or chromosomal integration) 3. Bioinformatic explorations involving function (insights from meta-transcriptomic studies)
-
-
www.nytimes.com www.nytimes.com
-
few sips can help lower your overall body temperature, mimicking its natural decline before you sleep
How does this compare with drinking hot milk which is adviced by some sources?
-
-
www.nature.com www.nature.com
-
Notes from Todd:
Huge caveat that the study assumes plasmids are all detected via variable sequencing depths and impossible to speak to copy numbers across varying sequencing technologies
Interesting at the exploratory level but just scratching the surface and may be biased due to the biased nature of the samples in the SRA
-
PCN was then calculated for each sample as the ratio between the mean coverage of plasmid contigs and the mean coverage of the chromosome
How robust is this measurement compared to qPCR (or even better: ddPCR) - Could be interesting to compare and benchmark this to some known data such as this paper: Accurate Determination of Plasmid Copy Number of Flow-Sorted Cells using Droplet Digital PCR
-
-
www.biorxiv.org www.biorxiv.org
-
Besides the review itself, it's a nice organization of longitudinal data, so can be useful when looking for datasets (Nick Sapoval)
-
-
genome.cshlp.org genome.cshlp.org
-
A k-mer based taxonomic classification tool ; Much smaller database size than Kraken. Uses compressed k-mer indexing using BWT compression and FM index
-
(compression) Uses only unique portions of new genomes to reduce redundancy in the index. - Fig 1
-
FM-index provides a means to exploit both large and small k-mer matches by enabling rapid search of k-mers of any length
-
Centrifuge can assign a sequence to multiple taxonomic categories
-
Centrifuger is a more compressed version? what are the trade offs of this?
In Centrifuger, the Burrows-Wheeler transformed genome sequences are losslessly compressed using a novel scheme called run-block compression
-
-
-
www.science.org www.science.org
-
This is very neat. This could be a good complex dataset with known ground truth to benchmark tools for strain resolution / time-course tracking / HGT tracking methods like rhea
I would re-create the bulk sequencing by truncating the droplet-specific barcode and collecting all the sequences together) ; I wish they parallelly sequenced the full sample without the droplets for this purpose though..
-
enables us to follow the relative abundances of these strains over time in the human donor
Isn't there a cheaper way to track strains without needing single-cell sequenced genomes?
-
-
journals.asm.org journals.asm.org
-
For each microbiome sample, its MNS was derived by searching its sequence against those of all samples produced by past studies
What does a "sample" mean: - A metagenome - a collection of sequences in a microbiome - a single sequence from a microbiome collection?
-
-
arxiv.org arxiv.org
-
evaluate UMA models on a diverse set of applications
which ones?
-
-
academic.oup.com academic.oup.com
-
Pairs of short reads with small edit distances, along with their unique molecular identifier tags, have been exploited to correct sequencing errors in both reads and tags.
nice summary of UMI working principle
-
-
zihad.com.bd zihad.com.bd
-
Sync Google Drive in linux using podman and rclone
What's the advantage of podman here?
-
-
zihad.com.bd zihad.com.bd
-
let’s automate the command using systemd
How does this compare to cronjob - don't need a system file for that
-
-
www.baeldung.com www.baeldung.com
-
First, we create a symbolic link to the unit file in the /etc/systemd/system directory.
Simlink has an issue when the service will fail to load on startup. Copying the file is better (Windsurf AI)
This is a common issue with systemd services that are symlinked from a user's home directory. The problem occurs because the home directory isn't mounted when systemd tries to read the service file during early boot. Here's how to fix it: 1. Copy the service file instead of symlinking it:
-
-
the-ken.com the-ken.com
-
It’s easier than teaching kids, and it’s more exciting in some ways,” said an AI trainer
Is it more satisfying than teaching humans though?
-
-
academic.oup.com academic.oup.com
-
Minimap2 is a new paradigm in mapping and by extension pairwise alignment. Uses concepts from full-genome aligners (seed-chain-align) and works for short, long reads (noisy) and RNA-seq as well. - Uses: read mapper, long-read overlapper, full-genome aligner
capability of minimap2 comes from a fast base-level alignment algorithm and an accurate chaining algorithm..
Minimap2 indexes reference k-mers with a hash table
-
-
clauswilke.com clauswilke.com
-
More recently, Lior Pachter has argued that t-SNE and the related UMAP do not serve a meaningful purpose in data analysis and are only useful for producing art.
Tags
Annotators
URL
-
-
www.nature.com www.nature.com
-
feature extraction attempts to reduce the dimensionality of a dataset by building a compressed representation of the input features
Example: Go from species to higher taxa ~ genera, family..
-
Methods like t-stochastic neighbor embedding (t-SNE) and uniform manifold approximation and projection (UMAP) faithfully capture and reveal local and non-linear relationships in complex microbiome datasets, but their tuning is finicky
-
Whether taxonomic or functional profiles provide a better discriminatory power in downstream analysis is subject to debate [23,24,25].
Do you mean like a classifier?
Tags
Annotators
URL
-
-
www.nature.com www.nature.com
-
Interesting note to read on pseudocont value and how it should be set to be consistent across tools
Tags
Annotators
URL
-
-
-
Different channels can have the same package, so conda must handle these channel collisions.
Biopython has this issue. It sometimes causes errors of package resolution since it is present in both bioconda (older versions) and conda-forge (more recent, maintained)
Error:
libmamba Could not solve for environment specs
To solve this, put
conda-forge
at higher priority thanbioconda
-
-
www.nature.com www.nature.com
-
many environments can still not be classified well at the species level11 because of database incompleteness. Our strategy for tackling this problem was to design sylph so that researchers can create customized databases from their novel genomes or MAGs, although this requires the generation of new genomes for researchers working in undercharacterized microbiomes.
How can users create their own customized databases? - Find any reference to this in the methods/suppl
-
-
www.nytimes.com www.nytimes.com
-
make it too easy to build manufacturing sites that could cause more pollution
Can you force less pollution in other ways?
-
avoid rigorous environmental review
Does damage to the environment have to be in the same bucket as not wanting any change in their backyards (NIMBY)?
-
-
www.biorxiv.org www.biorxiv.org
-
Advocating for shallow metagenomics for better taxonomic resolution (sub-species) compared to 16S ; this is important for low microbial density samples (skin microbiomes).
-
16S amplicon sequencing exhibited extreme bias toward the most abundant taxon
I assume the PCR step causes most issue. and qPCR with species specific primers doesn't reproduce this since there is no competition?
-
- Jun 2025
-
www.ashbyhq.com www.ashbyhq.com
-
let AI surface which applicants match your given criteria.
The video here gives a nice glance for applicants to understand what ATS does
-
-
Local file Local file
-
especially interested incandidates with prior wet lab experience and a generalist quantitative mindset.
How do you show generalist quantitative mindset in resume? - maybe easier in cover letter?
-
-
academic.oup.com academic.oup.com
-
Minimap2 follows a typical seed-chain-align procedure as is used by most full-genome aligners
- anchor = exact matches of minimizers from the query (seeds) in the reference (from database)
- chain = sets of colinear anchors
- align = extending the chain + filling in the gaps
-
-
ccd.rice.edu ccd.rice.edu
-
Myers-Briggs Type Indicator
May benefit from this during interviews. Getting to know yourself better. strengths, weaknesses
Email Raylea for the code. the other 2 : strong and focus2 is more for undergrads
-
Myers-Briggs Type Indicator® (MBTI®) assessment
Another tool mentioned in the CCD appointment types is strong interest inventory
-
-
academic.oup.com academic.oup.com
-
We extracted 202 question-answer pairs from the KB and 39 questions generated by GPT-4 for training and testing purposes
Isn't it weird to train and test on the same questions?
-
-
www.science.org www.science.org
-
Read: kate adamala harm of mirror life
-
David A. Relman,
Todd says this is a vocal guy with influence in this camp
-
capability to create mirror life is likely at least a decade away and would require large investments and major technical advances
I believe, making these self-replicating is a long way off. Will need all the machinery including polymerases, ribosomes etc. as well as a way to make the necessary monomers from available forms in the environment
-
-
www.nature.com www.nature.com
-
Nice guideline document for thinking about contamination in low biomass samples and wet-lab + computational approaches of dealing with it
-
-
www.science.org www.science.org
-
This method merges an embedding based protein homolog search with a genomic context similarity. This needs a multi-modal LM including aa and DNA seqs. - Genomic context examples: CRISPR/defense islands.
Modalities of protein homolog (sequence similarity) search 1. amino acid sequence based:
BLAST
,HMMER
2. Embedding based search: usingESM2
embeddings - Structural search, usingAlphaFold
structures - But all of these lack the extra boost provided by adding in genomic context which currently is only done manually!
-
-
rice.app.box.com rice.app.box.com
-
$50 copay/visit
Urgent care.
No coverage for non-urgent use.
- Does CVS Minute clinic count as urgent care?
-
-
stackoverflow.com stackoverflow.com
-
If you make a 1 byte change and push the file again, you'll use another 500 MB of storage and no bandwidth
this seems insane; what if I don't want to track the versions of this large file and only keep the final versions? - There should be some option to just change the link to the latest version and dump the old version without using the bandwidth during download as well - See the latest version of git-lfs for info on this
-
-
genomebiology.biomedcentral.com genomebiology.biomedcentral.com
-
A higher resolution but still quick method to compare multiple genomes (unassembled also). Uses a full k-mer spectra instead of minHash methods
Unlike MinHash-based methods that produce distances and have lower resolution, KPop is able to accurately map sequences onto a low-dimensional space.
Questions: (before reading paper..)
-
By unassembled genomes, do you mean contigs?
-
How does this k-mer spectra make it higher resolution than minHash?
-
Does dataset dependence of these transformation make this a hurdle in some way?
-
-
KPop, a novel versatile method based on full k-mer spectra and dataset-specific transformations, through which thousands of assembled or unassembled microbial genomes can be quickly compared
Does dataset dependence of these transformation make this a hurdle in some way?
-
simplified signatures (“sketches”) based on some dataset-independent choices
Does dataset independence make these tools better than current one in some way?
-
the most relevant methods to classify or compare microbial genomes based on k-mers can be broadly divided into the following categories:
Good section to skim
-
KPop is able to accurately map sequences onto a low-dimensional space
The claim is that this is higher resolution than minHash methods like
mash
?
-
-
www.biorxiv.org www.biorxiv.org
-
Cool study that re-queries a wide range of metagenomic data to raise some new thoughts on phage host range questions
Read later to clarify thoughts in the hypothesis comments
-
we observed surprising cases of viruses targeted by microbes not expected to be viable hosts.
Interesting, need to read to find out why you won't expect something to be viable host. Is it mismatched environmental source of the phage vs the host / phylogenetic mismatch between expected host of the phage and spacer source?
-
CRISPR spacers frequently matched multiple MGEs
Do you mean the same spacer matches multiple MGEs? - Need to clarify better..
-
-
pmc.ncbi.nlm.nih.gov pmc.ncbi.nlm.nih.gov
-
Previous efforts to design enzymes have largely focused on finding geometric matches between model active sites and preexisting protein structures, an approach akin to buying a suit from a thrift store; it is unlikely the fit will be perfect.
Great analogy!🤣
-
-
www.geeksforgeeks.org www.geeksforgeeks.org
-
Unlike hard links, which point directly to the file data on the disk, symlinks are independent files that contain a path to another file or directory
Hard link vs soft link
I'm curious how a hard link would operate when synced to another computer via git/cloud drives.In my experience, I found that a hardlink I made in windows broke when used rclone sync with onedrive into a linux PC
-
Git treats symbolic links as special files that store the path to the target file. When you add a symbolic link to a Git repository, Git records the link information rather than the contents of the target file. Here’s how Git handles symbolic links during various operations:
is this only for softlinks?
-
-
www.biorxiv.org www.biorxiv.org
-
Their carriage often corresponds with changes to the host transcriptome
is this specific to conjugative plasmids or any plasmid?
-
-
microbiomejournal.biomedcentral.com microbiomejournal.biomedcentral.com
-
Summary: Uses short-reads from metagenomes to give strain level composition. employs tree-based k-mer indexing. Briefly they do: s1. cluster similar strains + tree index for searching, s2. generate strain specific k-mers (collinear blocks within same cluster) > build a matrix.
employs a novel tree-based k-mers indexing structure to strike a balance between the strain identification accuracy and the computational complexity…
By searching strains inside the identified clusters, StrainScan achieves a higher resolution than cluster-level tools such as StrainGE and StrainEst
Note: Also contrast with a newer Strainify tool?
-
-
pubs.acs.org pubs.acs.org
-
Macromolecular binding pockets, on the other hand, are located on the protein surface and are often shallower
protein-protein interactions?
-
-
-
paste it in Thunderbird as a quote block (Ctrl+Shift+o).
making a quite block in thunderbird
-
-
elifesciences.org elifesciences.org
-
taking fully advantage of our algorithm might involve coordination between multiple colleagues in a lab who are constructing plasmids with different expected sequences.
This is something a local core like GCEC can help with
-
it could be further reduced by executing time-consuming dynamic programming only for some query-reference pairs that necessitate high levels of accuracy and by introducing parallel computing
Nice, Any other ideas to reduce RAM use?
-
theoretical minimum number of reads that is required for the reliable consensus calculation is 30 reads per plasmid
Does this depend on the plasmid length and the preperation kit before sequencing that determines fragmentation?
-
-
nf-co.re nf-co.re
-
Please only use Conda as a last resort, i.e., when it’s not possible to run the pipeline with Docker or Singularity.
Why is conda not recommended?
-
-
-
To enable data augmentations and stitching of multiple contigs together, we introduce two special tokens. The ‘#’ token is used to join sequences from the same species with uncertain distance to each other, while the ‘@’ token is used for sequences that are from the same contig/strand and are near each other.
Do you ensure that the stitched contigs are in the same order within the chromosome - Is this better than using an assembly tool?
How are these delimiters
#
and@
processed at the output stage? - If these delimiters are de-emphasized during the calculations, would this promote evo2 to learn a false sense of continuity between contigs that are not connected within the actual genome? -
Evo 2 can also leverage its unique representation of biological complexity to generate new genomic sequences
What is the point in generating genomic sequences with some vague notion such as "naturalness"? - Assuming future adaptations would include prompting to generate specific sequence features; maybe it makes more sense in this context?
-
previously demonstrated that machine learning models trained on prokaryotic genomic sequences can model the function of DNA, RNA, and proteins
Elaborate "Can model the function"
-
-
www.mdpi.com www.mdpi.com
-
VOGDB, which is a database of virus orthologous groups. VOGDB is a multi-layer database that progressively groups viral genes into groups connected by increasingly remote similarity
Layers: 1. pair-wise sequence similarity 2. sequence profile alignment 3. predicted protein structures
The first layer is based on pair-wise sequence similarities, the second layer is based on the sequence profile alignments, and the third layer uses predicted protein structures to find the most remote similarity
-
-
www.nature.com www.nature.com
-
Specific gut species distinguish left-sided versus right-sided CRC (area under the curve = 0.66) with an enrichment of oral-typical microbes
It is very surprising that left and right side of colerectal cancers have heterogeneity!
-
-
resources.biginterview.com resources.biginterview.com
-
totally AI-generated resumes have a sameness to them that recruiters can tell right away. They all use similar language, and they’re almost identical.
-
-
www.cnn.com www.cnn.com
-
The Trump administration is preparing to cancel a large swath of federal funding for California
How do you prevent such partisan and vindictive actions by federal government on states?
Same thing is happening in India - Is there any framework people have seen in a more federated country, maybe Germany?
-
-
www.biorxiv.org www.biorxiv.org
-
or methodological differences in screening algorithms.
This could have been elaborated a bit more. It is too generic and rather obvious
TODO: try bringing this up to Todd on Slack..
Tags
Annotators
URL
-
-
academic.oup.com academic.oup.com
-
Interesting study that expands the similarity metric used to mark
core-genes
that determine clade membership and phylogeny by their homology. They sub-sample the similarity problem by predicting 3Di structural strings as opposed to full structure prediction- To identify core-genes, we traditionally use amino acid similarity (better than nucleotide.. codon usage differences)
- Going one step ahead, we can use protein structures/folds to generalize this further for deep clades where amino acid homology is quite low.
Read more to see how they implement this and how robust is this homology inferred an approximate subsampling like scheme onto 3Di structural strings generated from amino acids like alphafold
-
can also be defined using structures
You mean protein structure
-
-
benjamindlee.com benjamindlee.com
-
, I increasingly use the Nim programming language for data processing tasks. Nim is under-appreciated in computational science but it is a very capable Python replacement for non-numerical data processing. At a high level, Nim is as easy to write as Python and as fast as C
nim
= Interesting cool and fast python like programming languageHow does this compare to
Julia
?
-
-
benjamindlee.com benjamindlee.com
-
It’s really helpful to be able to dictate my academic papers using my phone when inspiration hits me, wherever that may be.
Audio transcription is available with a plug in?
-
-
journals.asm.org journals.asm.org
-
removed potential virulence genes and secretion systems (T3SS and T6SS) to ensure safety
Would keeping the secretion system enable easier protein purification through secretion tags?
-
-
arxiv.org arxiv.org
-
BioReason reasons over unseen biological entities and articulates decision-making through interpretable, step-by-step biological traces, offering a transformative approach for AI in biology that enables deeper mechanistic insights
Text in and text out: - Answers questions like
What are biological effects of this mutation and what disease it causes - (Summarized the example from fig 1)
-
-
www.nature.com www.nature.com
-
Designs primers for clade specific (pan-specific) viral qPCR, single or tiled amplicon sequencing
-
Can include degeneracy..
-
How is single different from qPCR? only amplicon length maybe?
-
Is it really better than Olivar? Why?
-
Can this make the MSA as well? That’s too specialized - there are already CLI tools out there
-
-
-
citsci.org citsci.org
-
Extremophile Campaign: In Your Home (ECIYH)
Really cool citizen science campaign!
I'm curious who are the scientists behind this. Find and document here/in a page note
-
really cold places like near your air conditioner drip tray
Is the drip tray really that cold? Would be nice to get a temperature measurement as well!
-
-
www.biorxiv.org www.biorxiv.org
-
identified 90% more genus-level phage-host interactions than traditional assembly-based methods
how do you know their ground truth / if these are false calls?
-
-
abcnews.go.com abcnews.go.com
-
While some tech giants neared or imposed widespread layoffs last year, compensation for their CEOs climbed as much as tens of millions of dollars, according to an ABC News analysis of data released by research firm Equilar in May and June.
How much of this is due to automatic factors such as stock valuation gains?
-
-
www.nature.com www.nature.com
-
Metabolome analyses identify 15 mediating metabolites in pregnancy that improve ADHD prediction
Good to add more support than self reported diet by surveys
-
- May 2025
-
www.iscb.org www.iscb.org
-
ISCB will grant remote presentation options for reasons associated to maternity/paternity leave, care for a family member, personal/medical disability, sickness, financial hardship, or potential visa problems.
potential visa problems: argue for USA re-entry issues
-
-
www.pnas.org www.pnas.org
-
Interesting paper about nucleotide sequence divergence using SCO genes (single-copy ortholog). They are thinking about a threshold for species identification in Eukaryotes like we do in prokaryotes
Neat part is they bring different taxa of wide ranging kingdoms to compare on the same plots! (don’t know if this is novel…)
In prokaryotes, homologous recombination, the basis of gene flow, depends directly on the degree of genomic sequence divergence, whereas in sexually reproducing eukaryotes, reproductive incompatibility can stem from changes in very few genes
Although no single threshold delineates species, eukaryotic populations with >1% genome-wide sequence divergence are likely separate species, whereas prokaryotic populations with 1% divergence are still able to recombine and thus can be considered the same species.
-
Measuring the sequence divergence (eukANI) between 173 pairs of sister species representing 65 orders of eukaryotes, we find that the degree of sequence similarity between species varies considerably across taxonomic groups and is not consistent for species within a genus
Does sister species = same genus?
-
67 eukaryotic “odb10” datasets from the Benchmarking Universal Single-Copy Orthologs (BUSCO) website (busco.ezlab.org), which specifies the SCOs common to members of selected taxonomic group
On avg, how many SCOs are common to taxonomic groups. Is that a large enough number (> 10s?..) to estimate variation like this?
-
In prokaryotes, homologous recombination, the basis of gene flow, depends directly on the degree of genomic sequence divergence, whereas in sexually reproducing eukaryotes, reproductive incompatibility can stem from changes in very few genes.
Does the data in the fig 1 change drastically if any of the BUSCO genes are involved in reproductive compatibility? Since variation in these won't be representative of the rest of the eukaryotic genome?
-
even in cases in which organisms themselves are neither in hand nor witnessed
Intereresting choice of phrases: - in hand = isolated - witnessed = ? / microscopy?
These must be experimentalists writing this!
-
-
www.theiagen.com www.theiagen.com
-
See how this tool is different from seqera.
(Peter van Heusden @ Slack) The problem has been finding the right combination of platform and business model. So you've got Seven Bridges and DNANexus, but they're not playing in our world. Funding from Gates et al have turned Terra into a low cost to use platform and Theiagen has built what seems like a effective business on that. Again, platform and business model seem key. Seqera, those others I just mentioned, the platform is the business. Because of nextflow, Seqera can leverage the vast volunteer effort of the nf-core community but it still is different to the low-cost platform with paid support that Theiagen has developed (and made useable through their investment in workflows, containers, etc). I think that some of those other platforms that were discussed over on #infrastructure give potential for similar developments.
-
-
bmcbioinformatics.biomedcentral.com bmcbioinformatics.biomedcentral.com
-
Interesting paper that claims to improve 16S taxonomic classification -- on par with WGS using ML. Read more to figure out -
-
How does this work? What’s the ML magic doing here?
-
Is it really as good as WGS?:
-
Considering that the 16S region itself doesn’t have full information to resolve species..!?
-
And this is also using short read only: 16S-V4,V4
-
-
-
compared the taxonomic profiles at multiple levels derived from both 16S amplicon sequencing and WGS using an in-house produced microbiome dataset
This is short read data of 16S V3,V4 regions. Not 16S full length // Should have been clarified in the paper!
V3-V4 hypervariable region of the 16S rRNA gene was amplified using the primers 338 F (ACTCCTACGGGAGGCAGCAG) and 806R (GGACTACHVGGGTWTCTAAT).
-
-
microbiomejournal.biomedcentral.com microbiomejournal.biomedcentral.com
-
Bcell and BOTU, which represent the genome-sequenced proportions of cells and taxa (at 100%, > 98.6%, or > 97% identities in the 16S-V4 region) in a specific prokaryotic biome, respectively
How are
Cells
andTaxa
defined here? -
the cell and taxon proportions of genome-sequenced bacteria or archaea on earth remain unknown.
They are calculating the fraction of taxa within metagenomic datasets (like earth microbiome project) that have been fully genome sequenced.
They are doing this by sequence alignment of the 16S-V4 region - For cell: 100% identity to genome - ?/ For taxa:> 97% identity to {some set of genomes?}
-
we conducted a large-scale sequence alignment between the data released by the EMP and the sequenced bacterial or archaeal genomes in the public database
How is this different from taxonomic profiling that the earth microbiome project would have already done?
-