MGV Connector
Data
referencing from https://github.com/snayfach/MGV:
Nayfach et al. Metagenomic compendium of 189,680 DNA viruses from the human gut microbiome. 2021. https://www.nature.com/articles/s41564-021-00928-6.
setup
1. Viral detection pipeline
-
conda install bioconda::prodigal- or
https://github.com/hyattpd/Prodigal/wiki/installation#installing-on-mac-os-x
HMMER v3.1b2
- trying HMMER v3.4 (latest version)
conda install bioconda::hmmersadly, v3.4 works differently than v3.1b2, I’ll go back to v3.1b2 for now.. buthaha, this is why I love cs, v3.1b2 doesn’t even compilegoing back to v3.4actually I accidentally skipped over the ‘prodigal’ step so v3.4 didn’t work at the first time
- v3.1b2
- trying HMMER v3.4 (latest version)
VirFinder v1.1
- glmnet
install.packages("glmnet", dependencies=TRUE) - Rcpp
install.packages("Rcpp", dependencies=TRUE) - qvalue
- I tried:
1
2
3install.packages("devtools")
library("devtools")
install_github("jdstorey/qvalue")
- I tried:
- then VirFinder
- But the code provided didn’t work for me.
- The file we downloaded technically won’t work directly.
- We will only need a few step from there, depress the downloaded
zipor.tar.gzfile, then go to the specific folder that matches your OS,install.packagethe VirFinder in that folder.
- glmnet
2. Quality Control
-
- Some users may wish to update the database using their own complete genomes:
1
checkv update_database /path/to/checkv-db /path/to/updated-checkv-db genomes.fna
- There are two ways to run CheckV:
- Using a single command to run the full pipeline (recommended):
1
checkv end_to_end input_file.fna output_directory -t 16
- Using individual commands for each step in the pipeline:
1
2
3
4checkv contamination input_file.fna output_directory -t 16
checkv completeness input_file.fna output_directory -t 16
checkv complete_genomes input_file.fna output_directory
checkv quality_summary input_file.fna output_directory
- Using a single command to run the full pipeline (recommended):
- Please refer to Berkeley Lab Checkv for specific functions it has, or meaning of output.
- Some users may wish to update the database using their own complete genomes:
Installing
- conda install didn’t work for me,need a few additional package
- rust
1
conda install rust
- diamond (use version 2.1.8 simply because a bug showed up for 2.1.9, and the newest version 2.1.12 is also not compatible, so went for the guaranteed version)
1
conda install bioconda::diamond=2.1.8
- rust
- conda install didn’t work for me,need a few additional package
3. Cluster genomes based on ANI
1 | conda install bioconda::blast |
- other than this just follow along, I need this because I am missing it, you may even skip this installation
4. Cluster genomes based on AAI
1 | conda install bioconda::mcl |
- Stijn van Dongen, Graph Clustering Via a Discrete Uncoupling Process,
SIAM Journal on Matrix Analysis and Applications, 30(1):121-141, 2008.
( http://link.aip.org/link/?SJMAEL/30/121/1 ) - same, just follow along the manual if you are not missing anything.
5. Create SNP phylogenetic trees
- Package
- MUMmer v4.0.0beta2
- FastTreeMP v2.1.10
- newer versions are fine
6. marker tree
It was at this moment that I knew, protein data is all that I need, maybe some of the previous clusters, oh yeah, the clusters are helpful, but what’s really needed, also for the later on foldseek part is that raw protein data.
Pipeline
- So notice that, by following the previous steps we can’t necessarily get several representative fna files to create a snp tree