Data

https://portal.nersc.gov/MGV/

referencing from https://github.com/snayfach/MGV:
Nayfach et al. Metagenomic compendium of 189,680 DNA viruses from the human gut microbiome. 2021. https://www.nature.com/articles/s41564-021-00928-6.

setup

1. Viral detection pipeline

  • Prodigal v2.6.3

  • HMMER v3.1b2

    • trying HMMER v3.4 (latest version)
      conda install bioconda::hmmer
      • sadly, v3.4 works differently than v3.1b2, I’ll go back to v3.1b2 for now.. but
        • haha, this is why I love cs, v3.1b2 doesn’t even compile
        • going back to v3.4
        • actually I accidentally skipped over the ‘prodigal’ step so v3.4 didn’t work at the first time
    • v3.1b2
  • VirFinder v1.1

    • glmnet install.packages("glmnet", dependencies=TRUE)
    • Rcpp install.packages("Rcpp", dependencies=TRUE)
    • qvalue
      • I tried:
        1
        2
        3
        install.packages("devtools")
        library("devtools")
        install_github("jdstorey/qvalue")
    • then VirFinder
      • But the code provided didn’t work for me.
      • The file we downloaded technically won’t work directly.
      • We will only need a few step from there, depress the downloaded zip or .tar.gz file, then go to the specific folder that matches your OS, install.package the VirFinder in that folder.

2. Quality Control

  • CheckV

    • Some users may wish to update the database using their own complete genomes:
      1
      checkv update_database /path/to/checkv-db /path/to/updated-checkv-db genomes.fna
    • There are two ways to run CheckV:
      • Using a single command to run the full pipeline (recommended):
        1
        checkv end_to_end input_file.fna output_directory -t 16
      • Using individual commands for each step in the pipeline:
        1
        2
        3
        4
        checkv contamination input_file.fna output_directory -t 16
        checkv completeness input_file.fna output_directory -t 16
        checkv complete_genomes input_file.fna output_directory
        checkv quality_summary input_file.fna output_directory
    • Please refer to Berkeley Lab Checkv for specific functions it has, or meaning of output.
  • Installing

    • conda install didn’t work for me,need a few additional package
      • rust
        1
        conda install rust
      • diamond (use version 2.1.8 simply because a bug showed up for 2.1.9, and the newest version 2.1.12 is also not compatible, so went for the guaranteed version)
        1
        conda install bioconda::diamond=2.1.8

3. Cluster genomes based on ANI

1
conda install bioconda::blast
  • other than this just follow along, I need this because I am missing it, you may even skip this installation

4. Cluster genomes based on AAI

1
conda install bioconda::mcl
  • Stijn van Dongen, Graph Clustering Via a Discrete Uncoupling Process,
    SIAM Journal on Matrix Analysis and Applications, 30(1):121-141, 2008.
    ( http://link.aip.org/link/?SJMAEL/30/121/1 )
  • same, just follow along the manual if you are not missing anything.

5. Create SNP phylogenetic trees

  • Package
    • MUMmer v4.0.0beta2
    • FastTreeMP v2.1.10
    • newer versions are fine

6. marker tree

  • It was at this moment that I knew, protein data is all that I need, maybe some of the previous clusters, oh yeah, the clusters are helpful, but what’s really needed, also for the later on foldseek part is that raw protein data.

  • Pipeline

    • So notice that, by following the previous steps we can’t necessarily get several representative fna files to create a snp tree

step