2,340 Matching Annotations
  1. Jul 2025
    1. output: path "UPPER-${input_file}" script: """ cat '$input_file' | tr '[a-z]' '[A-Z]' > 'UPPER-${input_file}'

      how can I minimize the repitition in the output path name in this nextflow process?

    2. things could get a little tricky, because we need to be able to handle an arbitrary number of input files. Specifically, we can't write the command up front, so we need to tell Nextflow how to compose it at runtime based on what inputs flow into the process.
    1. write another pipeline that calls on one of those processes, you just need to type one short import statement to use the relevant module. This is better than just copy-pasting the code, because if later you decide to improve the module, all your pipelines will inherit the improvements.
    1. They encapsulate applications and dependencies in portable, self-contained packages that can be easily distributed. Containers are also key to enabling predictable and reproducible results.
    2. Nextflow was one of the first workflow technologies to fully embrace containers for data analysis pipelines.

      as opposed to using conda as much as possible before containerization?

    3. Today, workflows may comprise dozens of distinct container images. Pipeline developers must manage and maintain these containers and ensure that their functionality precisely aligns with the requirements of every pipeline task.
    4. Wave — a container provisioning and augmentation service that is fully integrated with the Nextflow and Nextflow Tower ecosystems.
    1. Our platform combines novel hardware with AI-enabled bioinformatics to unlock the personalized medicine potential of the gut microbiome

      I wonder that bioinformatics they are doing that could be useful for omi

    1. Specifying the Conda environments in a separate configuration profile is therefore recommended to allow the execution via a command line option and to enhance the workflow portability
    2. process.conda = 'samtools'

      does this mean all tools / processes using conda need to be pre-specified in the conda.profile? seems dumb..

    3. You can also download Conda lock files from Wave build pages. These files list every package and its dependencies, so Conda doesn’t need to resolve the environment. This makes environment setup faster and more reproducible.
    1. The only difference when compared with legacy syntax is that the process is not bound with specific input and output channels, as was previously required using the from and into keywords respectively
    2. Another exciting feature of Nextflow DSL 2 is the ability to compose built-in operators, pipeline processes and sub-workflows with the pipe (|) operator
    1. (Table 3) Might be relevant to decontamination approaches suggested in the SOMATEM pipeline?

      (private pbk1) Read annotations in public group

    2. due to their inherent complexity and the limited availability of decontamination pipelines compared with those for marker gene datasets
    1. Please only use Conda as a last resort i.e. when it’s not possible to run the pipeline with Docker, Singularity, Podman, Shifter, Charliecloud, or Apptainer.

      Why?: seqera AI chat says nf-core has reproducibility issues with conda: run-time resolution causing hash verification failures. Here's the summary -


      When Conda is Still Appropriate

      Conda remains useful for:

      Development and prototyping: When you need flexibility to update packages
      Custom/proprietary software: When containers aren't available
      Resource-constrained environments: Where container overhead is problematic
      Legacy systems: Where container runtimes aren't available
      

      Best Practice Recommendation

      For production workflows, the recommended approach is:

      Primary: Use Docker/Singularity containers
      Development: Use Wave to generate containers from conda specs
      Fallback: Use conda only when containers aren't feasible
      Future: Leverage conda lock files for maximum reproducibility
      

      The "last resort" recommendation reflects the hard-learned lessons from managing nearly 1,500 nf-core modules and the practical challenges of maintaining reproducible bioinformatics workflows at scale.

    1. Process directives allow the specification of settings for the task execution such as cpus, memory, container, and other resources in the workflow script.
    1. Seqera Platform access token is not mandatory, but it is recommended in order to access private container repositories and pull public containers without being affected by service rate limits

      Usage limits here

    1. by not conflating these two steps, FastGA can be used for other downstream tasks such as finding recurrent insertions due to transposable elements
    2. The key idea is to reduce the number of k-mers inspected for seed matches by using only those that are minimizers in a window of some small size
    1. ou can define a command stub, which replaces the actual process command when the -stub-run or -stub command-line option is enabled:
    1. In this Review, we examine the main machine learning concepts, tasks and applications that are relevant for experimental and clinical microbiologists
    1. We then built the database using recent versions of NCBI RefSeq: version 221 for both bacteria (329,194 assemblies) and archaea (1,911) and version 222 for fungi (564)

      https://zenodo.org/records/10802546

      Need to create a reproducible process/script to update the database with newer versions of NCBI RefSeq!

    2. we introduce Lemur and Magnet, a pair of tools optimized for lightweight and accurate taxonomic profiling for long-read shotgun metagenomic datasets

      What makes this long-read compatible? The EM (expectation maximization) technique similar to Emu?

    1. BugBuster is a fully containerized, modular, and reproducible workflow implemented in Nextflow. The pipeline streamlines analysis at level of reads, contigs, and metagenome-assembled genomes (MAGs), offering dedicated modules for taxonomic profiling and resistome characterization.
    2. Thanks to the use of containers, BugBuster can be deployed with minimal configuration on workstations, high-performance clusters, or cloud platforms

      Does this really require containers or can be done with conda as well?

    1. an update to GTDB-Tk that uses a divide-and-conquer approach where user genomes are initially placed into a bacterial reference tree with family-level representatives followed by placement into an appropriate class-level subtree comprising species representatives
    1. Genome Taxonomy Database (GTDB; https://gtdb.ecogenomic.org) provides a phylogenetically consistent and rank normalized genome-based taxonomy for prokaryotic genomes sourced from the NCBI Assembly database.
    2. GTDB uses relative evolutionary divergence (RED) to delineate higher-rank taxa and average nucleotide identity (ANI) to delineate species clusters
    1. unlike the linear number of k-mers in a sequence, the number of subsequences grows exponentially

      What is k-mer vs subsequence difference?

      (duckduckgo-AI generated) A k-mer is a specific type of subsequence that consists of a fixed length (k) of nucleotides from a biological sequence, while a subsequence can be any sequence derived from another sequence by deleting some elements without changing the order of the remaining elements. In bioinformatics, k-mers are often used for tasks like DNA sequence assembly and analysis.

    1. best practices guidelines enforced by the project further ensure that the pipelines are robust, well-documented, and validated against real-world datasets.

      something to learn from and emulate?

    1. potential limitation of shotgun sequencing is the complexity of bioinformatics pipelines required for its analysis

      This is a great statement for making the Somatem pipeline more accessible

    2. Unexpectedly, k-mer approaches resulted in rather high false positive rates, which may lead to misinterpretations of microbial community composition.

      Could this be improved with tool choices and databases? - There are more recent tools than Kraken2-bracken and sourmash for this. - Centrifuger is a recent tool ; and sylph is known to have more stringent / less false positives

    1. nf-core project enforces strong guidelines for how pipelines are structured, and how the code is organized, configured and documented.
    1. It may seem like a lot of work to accomplish the same result as the original pipeline, but you do get all those lovely reports generated automatically
    1. unstructured data does not have a predefined data model, it is not easily processed and analyzed through conventional data tools and methods. It is best managed in nonrelational or NoSQL databases or in data lakes, which are designed to handle massive amounts of raw data in any format.
    1. language server parses scripts and config files according to the Nextflow language specification, which is more strict than the Nextflow CLI
    2. Include declarations in scripts and config files act as links, and ctrl-clicking them opens the corresponding script or config file.
    1. There are three most popular pipelines used for NGSanalyses: QIIME, mothur and MetAMOS

      Don't know if this statement is justified given that the number of citations differ by 2 orders of magnitude - Mothur: 20 K - Qiime : 37 K - MetAMOS: 230

    1. Using first-passage analysis validated by Monte Carlo simulations, we quantitatively characterize nucleotide-specific error rates during RNA polymerase II transcription

      (comments before reading in full:) Curious how you got all the rates mentioned in Fig 1C. - Appendix table S1 shows most rate constant parameters are fitted; I wonder how they were fitted - The rate constants that were fixed, did you get those from literature..?

    1. New taxa are added to the Taxonomy database as data are deposited for them.

      How often should I update this in a classifier like centrifuger?

    1. knowledge gap between microbes that are only studied en masse as communities and those select few species whose molecular, genetic, or physiological diversity is studied in detail.

      Since you are not capturing species not present in the title/abstract. You would also not capture a future study that employs automated robotics to study multiple organism like you mentioned in the previous paragraph!

    2. apply these tools to the myriad species that live in the understudied corners of our world.

      Roboticizing microbiology involves moving parts (shaking cultures), changing temperatures etc. and it will be harder to automate the study of understudied species with finicky behaviours. For example, certain streptomyces species (roseosporus) form aggregates if not grown with the proper shaking in a bevelled flask within viscous media with glass beads put in.

      How on earth do you automate your way out when you cannot standardize culture conditions for finicky organisms?

    3. Statisticians have taught for decades that the most efficient and robust experimental designs vary multiple factors simultaneously and then deconvolve the effects and interactions with simple statistical models

      This will be a nightmare in biology with low sample sizes and limited data. You will need a lot more depth of data to de-convolve factors efficiently even with the newer AI methods

    4. counted the number of PubMed articles that refer to each species in their title or abstract

      What about microbes mentioned in the body of the paper or even tables of supplementary material etc. Are these not significant enough to count as "understanding" these microbes?

      New AI based methods would make it possible to scrape such references given contextual keywords etc. that discriminate between casual references vs emphasis enough that the microbe is being "studied".

      Also, what does it mean for a microbe to be "understood" anyways? Do these all qualify, and at the same magnitude? 1. Microbiology (culture methods, media, growth rate calculations) 2. Synthetic bio (figuring out regulatory elements.. promoters, RBS and such that enable expressing genes on plasmids or chromosomal integration) 3. Bioinformatic explorations involving function (insights from meta-transcriptomic studies)

    1. few sips can help lower your overall body temperature, mimicking its natural decline before you sleep

      How does this compare with drinking hot milk which is adviced by some sources?

    1. Notes from Todd:

      Huge caveat that the study assumes plasmids are all detected via variable sequencing depths and impossible to speak to copy numbers across varying sequencing technologies

      Interesting at the exploratory level but just scratching the surface and may be biased due to the biased nature of the samples in the SRA

    1. Besides the review itself, it's a nice organization of longitudinal data, so can be useful when looking for datasets (Nick Sapoval)

    1. A k-mer based taxonomic classification tool ; Much smaller database size than Kraken. Uses compressed k-mer indexing using BWT compression and FM index

      • (compression) Uses only unique portions of new genomes to reduce redundancy in the index. - Fig 1

      • FM-index provides a means to exploit both large and small k-mer matches by enabling rapid search of k-mers of any length

      • Centrifuge can assign a sequence to multiple taxonomic categories


      Centrifuger is a more compressed version? what are the trade offs of this?

      In Centrifuger, the Burrows-Wheeler transformed genome sequences are losslessly compressed using a novel scheme called run-block compression

    1. This is very neat. This could be a good complex dataset with known ground truth to benchmark tools for strain resolution / time-course tracking / HGT tracking methods like rhea

      I would re-create the bulk sequencing by truncating the droplet-specific barcode and collecting all the sequences together) ; I wish they parallelly sequenced the full sample without the droplets for this purpose though..

    2. enables us to follow the relative abundances of these strains over time in the human donor

      Isn't there a cheaper way to track strains without needing single-cell sequenced genomes?

    1. For each microbiome sample, its MNS was derived by searching its sequence against those of all samples produced by past studies

      What does a "sample" mean: - A metagenome - a collection of sequences in a microbiome - a single sequence from a microbiome collection?

    1. Pairs of short reads with small edit distances, along with their unique molecular identifier tags, have been exploited to correct sequencing errors in both reads and tags.

      nice summary of UMI working principle

    1. First, we create a symbolic link to the unit file in the /etc/systemd/system directory.

      Simlink has an issue when the service will fail to load on startup. Copying the file is better (Windsurf AI)

      This is a common issue with systemd services that are symlinked from a user's home directory. The problem occurs because the home directory isn't mounted when systemd tries to read the service file during early boot. Here's how to fix it: 1. Copy the service file instead of symlinking it:

    1. It’s easier than teaching kids, and it’s more exciting in some ways,” said an AI trainer

      Is it more satisfying than teaching humans though?

    1. Minimap2 is a new paradigm in mapping and by extension pairwise alignment. Uses concepts from full-genome aligners (seed-chain-align) and works for short, long reads (noisy) and RNA-seq as well. - Uses: read mapper, long-read overlapper, full-genome aligner

      capability of minimap2 comes from a fast base-level alignment algorithm and an accurate chaining algorithm..

      Minimap2 indexes reference k-mers with a hash table

    1. feature extraction attempts to reduce the dimensionality of a dataset by building a compressed representation of the input features

      Example: Go from species to higher taxa ~ genera, family..

    2. Methods like t-stochastic neighbor embedding (t-SNE) and uniform manifold approximation and projection (UMAP) faithfully capture and reveal local and non-linear relationships in complex microbiome datasets, but their tuning is finicky
    1. Different channels can have the same package, so conda must handle these channel collisions.

      Biopython has this issue. It sometimes causes errors of package resolution since it is present in both bioconda (older versions) and conda-forge (more recent, maintained)

      Error: libmamba Could not solve for environment specs

      To solve this, put conda-forge at higher priority than bioconda

    1. many environments can still not be classified well at the species level11 because of database incompleteness. Our strategy for tackling this problem was to design sylph so that researchers can create customized databases from their novel genomes or MAGs, although this requires the generation of new genomes for researchers working in undercharacterized microbiomes.

      How can users create their own customized databases? - Find any reference to this in the methods/suppl

    1. avoid rigorous environmental review

      Does damage to the environment have to be in the same bucket as not wanting any change in their backyards (NIMBY)?

    1. Advocating for shallow metagenomics for better taxonomic resolution (sub-species) compared to 16S ; this is important for low microbial density samples (skin microbiomes).

    2. 16S amplicon sequencing exhibited extreme bias toward the most abundant taxon

      I assume the PCR step causes most issue. and qPCR with species specific primers doesn't reproduce this since there is no competition?

  2. Jun 2025

    Tags

    Annotators

    URL

    1. especially interested incandidates with prior wet lab experience and a generalist quantitative mindset.

      How do you show generalist quantitative mindset in resume? - maybe easier in cover letter?

    Annotators

    1. Minimap2 follows a typical seed-chain-align procedure as is used by most full-genome aligners
      • anchor = exact matches of minimizers from the query (seeds) in the reference (from database)
      • chain = sets of colinear anchors
      • align = extending the chain + filling in the gaps
    1. Myers-Briggs Type Indicator

      May benefit from this during interviews. Getting to know yourself better. strengths, weaknesses

      Email Raylea for the code. the other 2 : strong and focus2 is more for undergrads

    1. We extracted 202 question-answer pairs from the KB and 39 questions generated by GPT-4 for training and testing purposes

      Isn't it weird to train and test on the same questions?

    1. capability to create mirror life is likely at least a decade away and would require large investments and major technical advances

      I believe, making these self-replicating is a long way off. Will need all the machinery including polymerases, ribosomes etc. as well as a way to make the necessary monomers from available forms in the environment

    1. Nice guideline document for thinking about contamination in low biomass samples and wet-lab + computational approaches of dealing with it

    1. This method merges an embedding based protein homolog search with a genomic context similarity. This needs a multi-modal LM including aa and DNA seqs. - Genomic context examples: CRISPR/defense islands.

      Modalities of protein homolog (sequence similarity) search 1. amino acid sequence based: BLAST, HMMER 2. Embedding based search: using ESM2 embeddings - Structural search, using AlphaFold structures - But all of these lack the extra boost provided by adding in genomic context which currently is only done manually!

    1. If you make a 1 byte change and push the file again, you'll use another 500 MB of storage and no bandwidth

      this seems insane; what if I don't want to track the versions of this large file and only keep the final versions? - There should be some option to just change the link to the latest version and dump the old version without using the bandwidth during download as well - See the latest version of git-lfs for info on this

    1. A higher resolution but still quick method to compare multiple genomes (unassembled also). Uses a full k-mer spectra instead of minHash methods

      Unlike MinHash-based methods that produce distances and have lower resolution, KPop is able to accurately map sequences onto a low-dimensional space.

      Questions: (before reading paper..)

      • By unassembled genomes, do you mean contigs?

      • How does this k-mer spectra make it higher resolution than minHash?

      • Does dataset dependence of these transformation make this a hurdle in some way?

    2. KPop, a novel versatile method based on full k-mer spectra and dataset-specific transformations, through which thousands of assembled or unassembled microbial genomes can be quickly compared

      Does dataset dependence of these transformation make this a hurdle in some way?

    3. simplified signatures (“sketches”) based on some dataset-independent choices

      Does dataset independence make these tools better than current one in some way?

    4. KPop is able to accurately map sequences onto a low-dimensional space

      The claim is that this is higher resolution than minHash methods like mash?

    1. Cool study that re-queries a wide range of metagenomic data to raise some new thoughts on phage host range questions

      Read later to clarify thoughts in the hypothesis comments

    2. we observed surprising cases of viruses targeted by microbes not expected to be viable hosts.

      Interesting, need to read to find out why you won't expect something to be viable host. Is it mismatched environmental source of the phage vs the host / phylogenetic mismatch between expected host of the phage and spacer source?

    1. Previous efforts to design enzymes have largely focused on finding geometric matches between model active sites and preexisting protein structures, an approach akin to buying a suit from a thrift store; it is unlikely the fit will be perfect.

      Great analogy!🤣

    1. Unlike hard links, which point directly to the file data on the disk, symlinks are independent files that contain a path to another file or directory

      Hard link vs soft link

      I'm curious how a hard link would operate when synced to another computer via git/cloud drives.In my experience, I found that a hardlink I made in windows broke when used rclone sync with onedrive into a linux PC

    2. Git treats symbolic links as special files that store the path to the target file. When you add a symbolic link to a Git repository, Git records the link information rather than the contents of the target file. Here’s how Git handles symbolic links during various operations:

      is this only for softlinks?

    1. Summary: Uses short-reads from metagenomes to give strain level composition. employs tree-based k-mer indexing. Briefly they do: s1. cluster similar strains + tree index for searching, s2. generate strain specific k-mers (collinear blocks within same cluster) > build a matrix.

      employs a novel tree-based k-mers indexing structure to strike a balance between the strain identification accuracy and the computational complexity…

      By searching strains inside the identified clusters, StrainScan achieves a higher resolution than cluster-level tools such as StrainGE and StrainEst

      Note: Also contrast with a newer Strainify tool?

    1. Macromolecular binding pockets, on the other hand, are located on the protein surface and are often shallower

      protein-protein interactions?

    1. taking fully advantage of our algorithm might involve coordination between multiple colleagues in a lab who are constructing plasmids with different expected sequences.

      This is something a local core like GCEC can help with

    2. it could be further reduced by executing time-consuming dynamic programming only for some query-reference pairs that necessitate high levels of accuracy and by introducing parallel computing

      Nice, Any other ideas to reduce RAM use?

    3. theoretical minimum number of reads that is required for the reliable consensus calculation is 30 reads per plasmid

      Does this depend on the plasmid length and the preperation kit before sequencing that determines fragmentation?

    1. Please only use Conda as a last resort, i.e., when it’s not possible to run the pipeline with Docker or Singularity.

      Why is conda not recommended?

    1. To enable data augmentations and stitching of multiple contigs together, we introduce two special tokens. The ‘#’ token is used to join sequences from the same species with uncertain distance to each other, while the ‘@’ token is used for sequences that are from the same contig/strand and are near each other.

      Do you ensure that the stitched contigs are in the same order within the chromosome - Is this better than using an assembly tool?

      How are these delimiters # and @ processed at the output stage? - If these delimiters are de-emphasized during the calculations, would this promote evo2 to learn a false sense of continuity between contigs that are not connected within the actual genome?

    2. Evo 2 can also leverage its unique representation of biological complexity to generate new genomic sequences

      What is the point in generating genomic sequences with some vague notion such as "naturalness"? - Assuming future adaptations would include prompting to generate specific sequence features; maybe it makes more sense in this context?

    3. previously demonstrated that machine learning models trained on prokaryotic genomic sequences can model the function of DNA, RNA, and proteins

      Elaborate "Can model the function"

    1. VOGDB, which is a database of virus orthologous groups. VOGDB is a multi-layer database that progressively groups viral genes into groups connected by increasingly remote similarity

      Layers: 1. pair-wise sequence similarity 2. sequence profile alignment 3. predicted protein structures

      The first layer is based on pair-wise sequence similarities, the second layer is based on the sequence profile alignments, and the third layer uses predicted protein structures to find the most remote similarity

    1. Specific gut species distinguish left-sided versus right-sided CRC (area under the curve = 0.66) with an enrichment of oral-typical microbes

      It is very surprising that left and right side of colerectal cancers have heterogeneity!

    1. The Trump administration is preparing to cancel a large swath of federal funding for California

      How do you prevent such partisan and vindictive actions by federal government on states?

      Same thing is happening in India - Is there any framework people have seen in a more federated country, maybe Germany?

    1. or methodological differences in screening algorithms.

      This could have been elaborated a bit more. It is too generic and rather obvious

      TODO: try bringing this up to Todd on Slack..

    1. Interesting study that expands the similarity metric used to mark core-genes that determine clade membership and phylogeny by their homology. They sub-sample the similarity problem by predicting 3Di structural strings as opposed to full structure prediction

      • To identify core-genes, we traditionally use amino acid similarity (better than nucleotide.. codon usage differences)
      • Going one step ahead, we can use protein structures/folds to generalize this further for deep clades where amino acid homology is quite low.

      Read more to see how they implement this and how robust is this homology inferred an approximate subsampling like scheme onto 3Di structural strings generated from amino acids like alphafold

    1. , I increasingly use the Nim programming language for data processing tasks. Nim is under-appreciated in computational science but it is a very capable Python replacement for non-numerical data processing. At a high level, Nim is as easy to write as Python and as fast as C

      nim = Interesting cool and fast python like programming language

      How does this compare to Julia?

    1. It’s really helpful to be able to dictate my academic papers using my phone when inspiration hits me, wherever that may be.

      Audio transcription is available with a plug in?

    1. removed potential virulence genes and secretion systems (T3SS and T6SS) to ensure safety

      Would keeping the secretion system enable easier protein purification through secretion tags?

    1. BioReason reasons over unseen biological entities and articulates decision-making through interpretable, step-by-step biological traces, offering a transformative approach for AI in biology that enables deeper mechanistic insights

      Text in and text out: - Answers questions like

      What are biological effects of this mutation and what disease it causes - (Summarized the example from fig 1)

    1. Designs primers for clade specific (pan-specific) viral qPCR, single or tiled amplicon sequencing

      • Can include degeneracy..

      • How is single different from qPCR? only amplicon length maybe?

      • Is it really better than Olivar? Why?

      • Can this make the MSA as well? That’s too specialized - there are already CLI tools out there

    1. Extremophile Campaign: In Your Home (ECIYH)

      Really cool citizen science campaign!

      I'm curious who are the scientists behind this. Find and document here/in a page note

    2. really cold places like near your air conditioner drip tray

      Is the drip tray really that cold? Would be nice to get a temperature measurement as well!

    1. identified 90% more genus-level phage-host interactions than traditional assembly-based methods

      how do you know their ground truth / if these are false calls?

    1. While some tech giants neared or imposed widespread layoffs last year, compensation for their CEOs climbed as much as tens of millions of dollars, according to an ABC News analysis of data released by research firm Equilar in May and June.

      How much of this is due to automatic factors such as stock valuation gains?

    1. Metabolome analyses identify 15 mediating metabolites in pregnancy that improve ADHD prediction

      Good to add more support than self reported diet by surveys

  3. May 2025
    1. ISCB will grant remote presentation options for reasons associated to maternity/paternity leave, care for a family member, personal/medical disability, sickness, financial hardship, or potential visa problems.

      potential visa problems: argue for USA re-entry issues

    1. Interesting paper about nucleotide sequence divergence using SCO genes (single-copy ortholog). They are thinking about a threshold for species identification in Eukaryotes like we do in prokaryotes

      Neat part is they bring different taxa of wide ranging kingdoms to compare on the same plots! (don’t know if this is novel…)

      In prokaryotes, homologous recombination, the basis of gene flow, depends directly on the degree of genomic sequence divergence, whereas in sexually reproducing eukaryotes, reproductive incompatibility can stem from changes in very few genes

      Although no single threshold delineates species, eukaryotic populations with >1% genome-wide sequence divergence are likely separate species, whereas prokaryotic populations with 1% divergence are still able to recombine and thus can be considered the same species.

    2. Measuring the sequence divergence (eukANI) between 173 pairs of sister species representing 65 orders of eukaryotes, we find that the degree of sequence similarity between species varies considerably across taxonomic groups and is not consistent for species within a genus

      Does sister species = same genus?

    3. 67 eukaryotic “odb10” datasets from the Benchmarking Universal Single-Copy Orthologs (BUSCO) website (busco.ezlab.org), which specifies the SCOs common to members of selected taxonomic group

      On avg, how many SCOs are common to taxonomic groups. Is that a large enough number (> 10s?..) to estimate variation like this?

    4. In prokaryotes, homologous recombination, the basis of gene flow, depends directly on the degree of genomic sequence divergence, whereas in sexually reproducing eukaryotes, reproductive incompatibility can stem from changes in very few genes.

      Does the data in the fig 1 change drastically if any of the BUSCO genes are involved in reproductive compatibility? Since variation in these won't be representative of the rest of the eukaryotic genome?

    5. even in cases in which organisms themselves are neither in hand nor witnessed

      Intereresting choice of phrases: - in hand = isolated - witnessed = ? / microscopy?

      These must be experimentalists writing this!

    1. See how this tool is different from seqera.

      (Peter van Heusden @ Slack) The problem has been finding the right combination of platform and business model. So you've got Seven Bridges and DNANexus, but they're not playing in our world. Funding from Gates et al have turned Terra into a low cost to use platform and Theiagen has built what seems like a effective business on that. Again, platform and business model seem key. Seqera, those others I just mentioned, the platform is the business. Because of nextflow, Seqera can leverage the vast volunteer effort of the nf-core community but it still is different to the low-cost platform with paid support that Theiagen has developed (and made useable through their investment in workflows, containers, etc). I think that some of those other platforms that were discussed over on #infrastructure give potential for similar developments.

    1. Interesting paper that claims to improve 16S taxonomic classification -- on par with WGS using ML. Read more to figure out -

      • How does this work? What’s the ML magic doing here?

      • Is it really as good as WGS?:

        • Considering that the 16S region itself doesn’t have full information to resolve species..!?

        • And this is also using short read only: 16S-V4,V4

    2. compared the taxonomic profiles at multiple levels derived from both 16S amplicon sequencing and WGS using an in-house produced microbiome dataset

      This is short read data of 16S V3,V4 regions. Not 16S full length // Should have been clarified in the paper!

      V3-V4 hypervariable region of the 16S rRNA gene was amplified using the primers 338 F (ACTCCTACGGGAGGCAGCAG) and 806R (GGACTACHVGGGTWTCTAAT).

    1. Bcell and BOTU, which represent the genome-sequenced proportions of cells and taxa (at 100%, > 98.6%, or > 97% identities in the 16S-V4 region) in a specific prokaryotic biome, respectively

      How are Cells and Taxa defined here?

    2. the cell and taxon proportions of genome-sequenced bacteria or archaea on earth remain unknown.

      They are calculating the fraction of taxa within metagenomic datasets (like earth microbiome project) that have been fully genome sequenced.

      They are doing this by sequence alignment of the 16S-V4 region - For cell: 100% identity to genome - ?/ For taxa:> 97% identity to {some set of genomes?}

    3. we conducted a large-scale sequence alignment between the data released by the EMP and the sequenced bacterial or archaeal genomes in the public database

      How is this different from taxonomic profiling that the earth microbiome project would have already done?