Data

https://portal.nersc.gov/MGV/

referencing from https://github.com/snayfach/MGV:
Nayfach et al. Metagenomic compendium of 189,680 DNA viruses from the human gut microbiome. 2021. https://www.nature.com/articles/s41564-021-00928-6.

setup

1. Viral detection pipeline

Prodigal v2.6.3
- conda install bioconda::prodigal
- or
  https://github.com/hyattpd/Prodigal/wiki/installation#installing-on-mac-os-x
HMMER v3.1b2
- trying HMMER v3.4 (latest version)
  conda install bioconda::hmmer
  - ~~sadly, v3.4 works differently than v3.1b2, I’ll go back to v3.1b2 for now.. but~~
    - ~~haha, this is why I love cs, v3.1b2 doesn’t even compile~~
    - ~~going back to v3.4~~
    - ~~actually I accidentally skipped over the ‘prodigal’ step so v3.4 didn’t work at the first time~~
- v3.1b2
VirFinder v1.1
- glmnet install.packages("glmnet", dependencies=TRUE)
- Rcpp install.packages("Rcpp", dependencies=TRUE)
- qvalue
  - I tried:
    1
    2
    3
    install.packages("devtools")
    library("devtools")
    install_github("jdstorey/qvalue")
- then VirFinder
  - But the code provided didn’t work for me.
  - The file we downloaded technically won’t work directly.
  - We will only need a few step from there, depress the downloaded zip or .tar.gz file, then go to the specific folder that matches your OS, install.package the VirFinder in that folder.

2. Quality Control

CheckV

Some users may wish to update the database using their own complete genomes:

1	checkv update_database /path/to/checkv-db /path/to/updated-checkv-db genomes.fna

There are two ways to run CheckV:

Using a single command to run the full pipeline (recommended):
1
checkv end_to_end input_file.fna output_directory -t 16

Using individual commands for each step in the pipeline:

checkv contamination input_file.fna output_directory -t 16
checkv completeness input_file.fna output_directory -t 16
checkv complete_genomes input_file.fna output_directory
checkv quality_summary input_file.fna output_directory

Please refer to Berkeley Lab Checkv for specific functions it has, or meaning of output.

Installing
- conda install didn’t work for me,need a few additional package
  - rust
    1
    conda install rust
  - diamond (use version 2.1.8 simply because a bug showed up for 2.1.9, and the newest version 2.1.12 is also not compatible, so went for the guaranteed version)
    1
    conda install bioconda::diamond=2.1.8

3. Cluster genomes based on ANI

1	conda install bioconda::blast

other than this just follow along, I need this because I am missing it, you may even skip this installation

4. Cluster genomes based on AAI

1	conda install bioconda::mcl

Stijn van Dongen, Graph Clustering Via a Discrete Uncoupling Process,
SIAM Journal on Matrix Analysis and Applications, 30(1):121-141, 2008.
( http://link.aip.org/link/?SJMAEL/30/121/1 )
same, just follow along the manual if you are not missing anything.

5. Create SNP phylogenetic trees

Package
- MUMmer v4.0.0beta2
- FastTreeMP v2.1.10
- newer versions are fine

6. marker tree

It was at this moment that I knew, protein data is all that I need, maybe some of the previous clusters, oh yeah, the clusters are helpful, but what’s really needed, also for the later on foldseek part is that raw protein data.
Pipeline
- So notice that, by following the previous steps we can’t necessarily get several representative fna files to create a snp tree

MGV Connector