Cuttag Tutorial Tutorial
前言
这里是CUT&Tag Tutorial的Tutorial,本教程旨在帮助对计算机知识不太熟悉的同学们理解并掌握CUT&Tag技术。通过详细的步骤说明和实际操作示例,我们希望能够让每一位读者顺利完成这一实验,并且在过程中提高对相关工具和命令行操作的熟悉度。本教程不仅涵盖了基本的实验步骤,还包括了常见问题的解决方法和实际操作中的注意事项。希望通过本教程,大家可以自信地应用CUT&Tag技术进行实验研究。
使用过程中结合tutorial一起看,哪里不懂就找对应的区域
- RStudio与Conda要使用到的包
- R
library(dplyr)
library(ggplot2)
library(ggpubr)
library(viridis)
library(GenomicRanges)
library(chromVAR) - Conda
conda的包不需要提前下载,后续用到哪个下载哪个就好
Bowtie2 (version >= 2.3.4.3)
samtools (version >= 1.10)
bedtools (version >= 2.29.1)
Picard (version >= 2.18.29)
SEACR (version >= 1.3)
deepTools (version >= 2.0)
- R
前期工作
$projPath
$projPath作为一个”快捷方式“会在整个tutorial内高频率使用。如tutorial内所示,projPath是path to data and results,为防止找不到需要的数据的情况发生,建议大家在个人目录下也就是在~/下创建一个目录专门保存所有需要的数据
mkdir -p ~/cuttag_project
接下来,有了存储数据和结果的地方之后就可以将该路径赋予给projPath1
2
3
4cd cuttag_project //进入该目录
pwd //获取该目录路径
projPath="/Users/tonyzh/cuttag_project" //双引号内放上面返还的路径Conda
连接学校集群的同学可以只做初始化那一步- 下载miniconda3
https://docs.anaconda.com/miniconda/#quick-command-line-install
此链接通过命令行下载,将页面拉到最下面,选择你的操作系统,再将提供的几行代码复制到终端里 - 初始化
网站为两个Shell提供了初始化方法,可以通过echo $SHELL来确认自己的Shell,再选择你的shell的初始化代码,后通过来更新Shell完成初始化1
2
3source ~/.zshrc
source ~/.bashrc - 下载并配置Conda
1
2
3
4
5
6conda install conda // 下载conda
conda --version //可以通过这一步判断conda是否正确安装,如果报错回看上一步初始化
conda config --show channels //目前channels只有一个base,是默认channel,需要添加两个
conda config --append channels conda-forge //添加一个conda-forge channel
conda config --append channels bioconda //添加一个bioconda channel
- 下载miniconda3
导入data
可以看到教程内一共有六组数据,分别是H3K27me3的rep1rep2,H3K4me3的rep1rep2,和IgG的rep1rep2。 教程里面只提供了IgG_rep2的command line下载方法,其他数据需要
https://www.ebi.ac.uk/ena/browser/home
进入该网址,在Enter accession处复制粘贴SRA entry:SRX########
往下翻可以看到下载数据的位置,有两种选择,通过wget或者直接浏览器下载。
首先需要创建存储数据的目录,将六组数据分别存储1
2
3
4mkdir -p ~/cuttag_project/data/IgG_rep2
mkdir -p ~/cuttag_project/data/IgG_rep1
mkdir -p ~/cuttag_project/data/K27me3_rep2
...- Chrome:直接下载就好
- Wget: 实际的操作其实是下载TSV格式的download report再通过report内提供的链接wget,为提高效率我将所有链接都贴在下面了,可以直接使用
1
2wget -O $projPath/data/dataName_rep#/dataName_rep#_R#.fastq.gz ftp://#links below#
K27me3_rep1: SRX8754646K27me3_rep2: SRX77136781
2
3R1: ftp.sra.ebi.ac.uk/vol1/fastq/SRR122/017/SRR12246717/SRR12246717_1.fastq.gz
R2: ftp.sra.ebi.ac.uk/vol1/fastq/SRR122/017/SRR12246717/SRR12246717_2.fastq.gzK4me3_rep1: SRX77136921
2
3R1: ftp.sra.ebi.ac.uk/vol1/fastq/SRR110/040/SRR11074240/SRR11074240_1.fastq.gz
R2: ftp.sra.ebi.ac.uk/vol1/fastq/SRR110/040/SRR11074240/SRR11074240_2.fastq.gzK4me3_rep2: SRX77136961
2
3R1: ftp.sra.ebi.ac.uk/vol1/fastq/SRR110/054/SRR11074254/SRR11074254_1.fastq.gz
R2: ftp.sra.ebi.ac.uk/vol1/fastq/SRR110/054/SRR11074254/SRR11074254_2.fastq.gzIgG_rep1: SRX84689091
2
3R1: ftp.sra.ebi.ac.uk/vol1/fastq/SRR110/058/SRR11074258/SRR11074258_1.fastq.gz
R2: ftp.sra.ebi.ac.uk/vol1/fastq/SRR110/058/SRR11074258/SRR11074258_2.fastq.gz1
2
3R1: ftp.sra.ebi.ac.uk/vol1/fastq/SRR119/024/SRR11923224/SRR11923224_1.fastq.gz
R2: ftp.sra.ebi.ac.uk/vol1/fastq/SRR119/024/SRR11923224/SRR11923224_2.fastq.gz
II. Data Pre-processing
$histName
还是要理解一下histName,histName在tutorial里的作用和projPath差不多,你需要创建你自己的histName,当然如果你非常清楚的知道你需要用的组蛋白是哪一个,也可以不去设置这个histName。只是tutorial需要这样一个变量去简化他整个教程的linux部分。说完histName,我们先进行2.2. Merge technical replicates 那一步
1
2
3
4
5
6
7
8mkdir -p ${projPath}/fastq
//cat将IgG_rep2_R1&R2里的001&002合并到一起
cat ${projPath}/data/${histName}/*_R1_*.fastq.gz > ${projPath}/fastq/${histname}_R1.fastq.gz
cat ${projPath}/data/${histName}/*_R2_*.fastq.gz > ${projPath}/fastq/${histname}_R2.fastq.gz
//cp将data内储存的数据复制到fastq
cp ${projPath}/data/${histName}_rep1_R1.fastq.gz ${projPath}/fastq/${histName}_rep1_R1.fastq.gz再回到上面直接照抄
1
2
3
4
5
6
7##== linux command ==##
//2.1.1 Obatain FastQC
mkdir -p $projPath/tools
wget -P $projPath/tools https://www.bioinformatics.babraham.ac.uk/projects/fastqc/fastqc_v0.11.9.zip
cd $projPath/tools
unzip fastqc_v0.11.9.zip1
2
3
4
5
6//2.2.2 Run FastQC for quality check
mkdir -p ${projPath}/fastqFileQC/${histName}
$projPath/tools/FastQC/fastqc -o ${projPath}/fastqFileQC/${histName} -f fastq ${projPath}/fastq/${histName}_R1.fastq.gz
$projPath/tools/FastQC/fastqc -o ${projPath}/fastqFileQC/${histName} -f fastq ${projPath}/fastq/${histName}_R2.fastq.gz
III. Alignment
3.1. Bowtie2 alignment
3.1.1
先下载bowtie2
conda install -c bioconda bowtie2先无视cores和ref,将四个目录建好,设好projPath的前提下照抄就行
1
2
3
4
5mkdir -p ${projPath}/alignment/sam/bowtie2_summary
mkdir -p ${projPath}/alignment/bam
mkdir -p ${projPath}/alignment/bed
mkdir -p ${projPath}/alignment/bedgraphbowtie2-build
“bowtie2-build builds a Bowtie index from a set of DNA sequences. bowtie2-buildoutputs a set of 6 files with suffixes .1.bt2, .2.bt2, .3.bt2, .4.bt2, .rev.1.bt2, and .rev.2.bt2. In the case of a large index these suffixes will have a bt2ltermination. These files together constitute the index: they are all that is needed to align reads to that reference. The original sequence FASTA files are no longer used by Bowtie 2 once the index is built.”创建一个bowtie2Index目录以存放bowtie index
mkdir -p ${projPath}/bowtie2Index下载hg38.fa
1
2wget -P $projPath https://hgdownload.cse.ucsc.edu/goldenpath/hg38/bigZips/hg38.fa.gz
bowtie2-build
1
2bowite2-build ${projPath}/hg38.fa ${projPath}/bowtie2Index/hg38
会比较慢,结束之后可以在bowtie2Index目录里看到6个以hg38开头的文件
设置cores&ref
cores=8ref="/Users/tonyzh/cuttag_project/bowtie2Index/hg38//你自己的路径需要说一下,因为我们一共有6个hg38文件供bowtie2识别,bowtie2的一个参数
-x <bt2-idx>可以识别全部,而我们只需要把hg38(也就是bowtie2-build的文件的base name)放在-x后面就可以让bowtie2识别全部6个文件1
2bowtie2 --end-to-end --very-sensitive --no-mixed --no-discordant --phred33 -I 10 -X 700 -p ${cores} -x ${ref} -1 ${projPath}/fastq/${histName}_R1.fastq.gz -2 ${projPath}/fastq/${histName}_R2.fastq.gz -S ${projPath}/alignment/sam/${histName}_bowtie2.sam &> ${projPath}/alignment/sam/bowtie2_summary/${histName}_bowtie2.txt
3.1.2
下载samtools
1
2
3conda install -c bioconda samtools
samtools --version下载E.Coli
https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_002853715.1/
下载.Fasta,可以下载到cuttag_project里bowtie2-build E.Coli
1
2bowtie2-build ${projpath}/ecoli.fa ${projPath}/bowtie2Index/ecoli
下载chromSize
1
2wget -P $projPath/hg38.chrom.sizes https://hgdownload.cse.ucsc.edu/goldenpath/hg38/bigZips/hg38.chrom.sizes
1
2bowtie2 --end-to-end --very-sensitive --no-mixed --no-discordant --phred33 -I 10 -X 700 -p ${cores} -x ${spikeInRef} -1 ${projPath}/fastq/${histName}_R1.fastq.gz -2 ${projPath}/fastq/${histName}_R2.fastq.gz -S $projPath/alignment/sam/${histName}_bowtie2_spikeIn.sam &> $projPath/alignment/sam/bowtie2_summary/${histName}_bowtie2_spikeIn.txt
seqDepth
需要将每个数据的seqDepth都存储下来以供后续第五步使用
所以建议全部数据都先跑一遍上一步,再跑这一步1
2
3
4seqDepthDouble=`samtools view -F 0x04 $projPath/alignment/sam/${histName}_bowtie2_spikeIn.sam | wc -l`
seqDepth=$((seqDepthDouble/2))
echo $seqDepth >$projPath/alignment/sam/bowtie2_summary/${histName}_bowtie2_spikeIn.seqDepth
3.2 Report sequencing mapping summary
R Commands就不放进来了,没啥好说的,直接复制就行
但还是具体说一下生成的表和图
3.2.1
这一步读取了${histName}_bowtie2.txt,可以随便打开一个.txt对照的读一下
“SequencingDepth”: 这一列提取的是总读数,就是txt文档里第一行的内容,所以实际上这一列不应该叫Sequencing Depth,因为他并不代表测序深度,而是所有原始读数的总数,应该是”Total number of reads” / “Total sequencing reads”
更新:上面说的也对,因为确实会引发歧义。但是在CUT&Tag工序中,所需的测序读取数相比ChIP-seq等技术要少得多,使得使用总读取书作为测序深度的衡量标准在经济上更具可行性。“MappedFragNum”: 成功映射到的基因组的片段总数,将文档内第四第五相加,对齐正好一次的片段数+对齐多于一次的片段数
“AlignmentRate”: 读取文档第6个元素,就是alignment rate
Histone Replicate Replicate SequencingDepth MappedFragNum_hg38 AlignmentRate_hg38 K27me3 rep1 2984630 2861228 95.87% K27me3 rep2 2702260 2608071 96.51% K4me3 rep1 1581710 1494846 94.51% K4me3 rep2 1885056 1743137 92.47% IgG rep1 2127635 1749579 82.23% IgG rep2 3361571 3071723 91.38% **更新9/26:** 我们的数据名称可能并不会像Tutorial里的K27me3_rep1和K4me3_rep2那样标准,这里举个解决方案1
2
3
4
5
6
7
8sampleList={"","","",...} //sampleList这里不变,读数据需要他
//原本的histList={"","","",...}删掉
for (hist in sampleList) { ... } //不变
histInfo = strsplit(hist, "_")[[1]] //这一行很重要
//strsplit将hist通过"_"分成不同部分,使用时histInfo[1]就代表第一个"_"前的部分
//如果你的数据名称并不是通过"_"来分隔的那就把这个改成你数据名使用的那个。例如"IgG-Rep1" strsplit(hist, "-")[[1]]
//在成功把数据名字分好区之后:3.2.2
这一步读取的是${histName}_bowtie2_spikeIn.txt,在原有图标的基础上增加了两个列,成功映射到外源DNA(这里是E. Coli大肠杆菌)的片段总数,以及对齐比例,其原理和上面3.2.1映射hg38时是一样的Histone Replicate SequencingDepth MappedFragNum_hg38 AlignmentRate_hg38 MappedFragNum_spikeIn AlignmentRate_spikeIn K27me3 rep1 2984630 2861228 95.87% 2705 0.09% K27me3 rep2 2702260 2608071 96.51% 2258 0.08% K4me3 rep1 1581710 1494846 94.51% 2120 0.13% K4me3 rep2 1885056 1743137 92.47% 4373 0.23% IgG rep1 2127635 1749579 82.23% 60062 2.82% IgG rep2 3361571 3071723 91.38% 94398 2.81% 3.2.3
把上面两个表和到一起了3.2.4
绘制盒须图+散点图
使用每组组蛋白的rep1和rep2的相对应的数值决定盒须图的大小值,散点图则是取相对应的值并赋予随机抖动以避免重叠
3.3. Remove Duplicates
- 3.3.
下载picard
picard最新版本在识别数据的时候会报错,数据header识别不出来,所以下载cuttag最低可用版本的picardconda install -c bioconda picard="2.18.29"picard --version//确认正确下载设置picardCMD
首先查询picard.jar路径find $(conda info --base) -name "picard.jar"
picardCMD=”/上述返还路径/picard.jar”继续
1
2
3
4
5
6
7
8
9
10
11mkdir -p ${projPath}/alignment/removeDuplicate/picard_summary
// Sort by coordinate
java -jar $picardCMD SortSam I=$projPath/alignment/sam/${histName}_bowtie2.sam O=$projPath/alignment/sam/${histName}_bowtie2.sorted.sam SORT_ORDER=coordinate
// Mark duplicates
java -jar $picardCMD MarkDuplicates I=$projPath/alignment/sam/${histName}_bowtie2.sorted.sam O=$projPath/alignment/removeDuplicate/${histName}_bowtie2.sorted.dupMarked.sam METRICS_FILE=$projPath/alignment/removeDuplicate/picard_summary/${histName}_picard.dupMark.txt
// Remove duplicates
java -jar picardCMD MarkDuplicates I=$projPath/alignment/sam/${histName}_bowtie2.sorted.sam O=$projPath/alignment/removeDuplicate/${histName}_bowtie2.sorted.rmDup.sam REMOVE_DUPLICATES=true METRICS_FILE=$projPath/alignment/removeDuplicate/picard_summary/${histName}_picard.rmDup.txtR部分
表格
DuplicationRate
重复率,重复片段数/总测序片段数 * 100%,可以在_picard.dupMarked.txt里找到’PERCENT_DUPLICATION‘的值乘以100就是他的重复率EstimatedLibrarySize
估计文库大小,指测序中估算的文库中包含的唯一片段的总数, _picard.dupMarked.txt文件中可以找到‘ESTIMATED_LIBRARY_SIZE‘UniqueFragNum
唯一片段数,指测序实验中被标记为唯一片段的数量,代表没有被标记为重复的、有用的片段数。唯一片段数 = 成功映射的片段数 x (1 - 重复率/100)。成功映射的片段数是前面通过bowtie2得到的图标
还是一样的盒须图+散点图,用rep1和rep2的两个数据作为upper bound & lower bound。
3.4 Assess mapped fragment size distribution
- 3.4.
- linux command
1
2
3// Extract the 9th column from the alignment sam file which is the fragment length, 具体在abs($9)展现
samtools view -F 0x04 $projPath/alignment/sam/${histName}_bowtie2.sam | awk -F'\t' 'function abs(x){return ((x < 0.0) ? -x : x)} {print abs($9)}' | sort | uniq -c | awk -v OFS="\t" '{print $2, $1/2}' >$projPath/alignment/sam/fragmentLen/${histName}_fragmentLen.txt - R command
读取_fragmentLen.txt
小提琴图直观展示了各组蛋白样本片段长度分布的形状、宽度和密度,便于比较不同组蛋白样本之间的变异性和集中趋势。较宽部分表示该片段长度出现频率较高,高度表示片段长度的范围。
线图则直接显示了不同片段长度的频数分布,清晰地展示了各组蛋白中最常见的片段长度。
- linux command
IV. Alignment results filtering and file format conversion
4.1 Filtering mapped reads by the mapping quality filtering
- linux command
1
2
3minQualityScore=2 //这行可以跳过,相对的,把下面遇到的$minQualityScore替换成2
samtools view -q $minQualityScore ${projPath}/alignment/sam/${histName}_bowtie2.sam >${projPath}/alignment/sam/${histName}_bowite2.qualityScore$minQualityScore.sam
4.2 File format conversion
- 下载bedtools
1
2conda install -c bioconda bedtools
- linux command
1
2
3
4
5
6
7
8
9
10
11
12## Filter and keep the mapped read pairs
samtools view -bS -F 0x04 $projPath/alignment/sam/${histName}_bowtie2.sam >$projPath/alignment/bam/${histName}_bowtie2.mapped.bam
## Convert into bed file format
bedtools bamtobed -i $projPath/alignment/bam/${histName}_bowtie2.mapped.bam -bedpe >$projPath/alignment/bed/${histName}_bowtie2.bed
## Keep the read pairs that are on the same chromosome and fragment length less than 1000bp.
awk '$1==$4 && $6-$2 < 1000 {print $0}' $projPath/alignment/bed/${histName}_bowtie2.bed >$projPath/alignment/bed/${histName}_bowtie2.clean.bed
## Only extract the fragment related columns
cut -f 1,2,6 $projPath/alignment/bed/${histName}_bowtie2.clean.bed | sort -k1,1 -k2,2n -k3,3n >$projPath/alignment/bed/${histName}_bowtie2.fragments.bed
4.3 Assess replicate reproducibility
- linux command
1
2
3
4// We use the mid point of each fragment to infer which 500bp bins does this fragment belong to.
binLen=500
awk -v w=$binLen '{print $1, int(($2 + $3)/(2*w))*w + w/2}' $projPath/alignment/bed/${histName}_bowtie2.fragments.bed | sort -k1,1V -k2,2n | uniq -c | awk -v OFS="\t" '{print $2, $3, $1}' | sort -k1,1V -k2,2n >$projPath/alignment/bed/${histName}_bowtie2.fragmentsCount.bin$binLen.bed - R command
读取_bowtie2.fragmentsCound.bin500.bed
相关性矩阵热图- 颜色深浅表示相关性洗漱的大小;红色越深,相关性越高,越接近1;颜色越浅,相关性越低,越接近0或负值。
- 不同样本相关性相同表示片段分布非常近,反之差异较大
- 黑框突出了一些样本聚类关系,该图中出现了相同组蛋白的两个重复样本聚类在一起
V. Spike-in calibration
这一步在3.x的基础上消除
- linux command
这里就需要之前提到的seqDepth,记得把$seqDepth替换成之前记录的,如果忘记记录的话就重新跑一遍3.1.2のseqDepth部分1
2
3
4
5
6
7
8
9
10if [[ "$seqDepth" -gt "1" ]]; then
mkdir -p $projPath/alignment/bedgraph
scale_factor=`echo "10000 / $seqDepth" | bc -l`
echo "Scaling factor for $histName is: $scale_factor!"
bedtools genomecov -bg -scale $scale_factor -i $projPath/alignment/bed/${histName}_bowtie2.fragments.bed -g $chromSize > $projPath/alignment/bedgraph/${histName}_bowtie2.fragments.normalized.bedgraph
fi
5.1 Scaling factor
- R
scaleFactor
这里涉及到一个命名/定义的问题:前面有提到Sequencing Depth在本篇教程中实际上指的是total reads。但是在CUT&Tag的数据处理中,Sequencing Depth这个术语有时会被宽泛地使用,直接指代样本的总读数。这样虽然有点简化了概念,但对于大多数数据处理和分析场景来说是有效的。因为他们的主要目的是通过标准化不同实验条件下的读数数量来比较不同条件的信号强度。
回到scaleFactor。scaleFactor(缩放因子)用于对不同实验条件下的样本进行标准化校正。这个因子基于外源DNA(通常是E. coli DNA)对比例来计算,目的是校正不同样本之间由于实验条件或测序深度的差异,而确保样本间的比较更加准确。(不受测序深度差异影响)
10000 / seqDepth(MappedFragNum_spikeIn) = scaleFactor
列表
| Histone | Replicate | SequencingDepth | MappedFragNum_hg38 | AlignmentRate_hg38 | MappedFragNum_spikeIn | AlignmentRate_spikeIn | DuplicationRate | EstimatedLibrarySize | UniqueFragNum | scaleFactor |
|---|---|---|---|---|---|---|---|---|---|---|
| K27me3 | rep1 | 3568894 | 2861228 | 95.87% | 2705 | 0.09% | 4.7577% | 29108084 | 2725099.4 | 3.6968577 |
| K27me3 | rep2 | 3208657 | 2608071 | 96.51% | 2258 | 0.08% | 1.0259% | 126236715 | 2581314.8 | 4.4286980 |
| K4me3 | rep1 | 1755438 | 1494846 | 94.51% | 2120 | 0.13% | 6.458% | 11069778 | 1398308.8 | 4.7169811 |
| K4me3 | rep2 | 2124574 | 1743137 | 92.47% | 4373 | 0.23% | 2.6512% | 32290874 | 1696923.0 | 2.2867597 |
| IgG | rep1 | 2621738 | 1749579 | 82.23% | 60062 | 2.82% | 81.0974% | 332438 | 330715.9 | 0.1664946 |
| IgG | rep2 | 4181972 | 3071723 | 91.38% | 94398 | 2.81% | 36.773% | 3073459 | 1942158.3 | 0.1059344 |
VI. Peak calling
6.1. SEACR
这一步中,主要目的是识别染色质中哪些与特定修饰或蛋白质结合的区域,这些区域富集了信号,通常称为“峰值”
- 下载seacr
1
2conda install -c bioconda seacr
- 下载chromVAR
1
2
3
4
5
6
7//新建R文档
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("chromVAR")
//再在console里输入
library(chromVAR) - 找SEACR_x.x.sh
一般在~/miniocnda3/lib/SEACR_x.x.sh,找到之后pwd,把返还的路径/SEACR_x.x.sh赋给seacr1
2
3
4
5
6
7
8seacr="/路径/SEACR_1.3.sh"
histControl=$2
mkdir -p $projPath/peakCalling/SEACR
bash $seacr $projPath/alignment/bedgraph/${histName}_bowtie2.fragments.normalized.bedgraph $projPath/alignment/bedgraph/${histControl}_bowtie2.fragments.normalized.bedgraph non stringent $projPath/peakCalling/SEACR/${histName}_seacr_control.peaks
bash $seacr $projPath/alignment/bedgraph/${histName}_bowtie2.fragments.normalized.bedgraph 0.01 non stringent $projPath/peakCalling/SEACR/${histName}_seacr_top0.01.peaks
6.1.1 Number of peaks called
- R
6.1.2的图涵盖了这一步,下面一起说
6.1.2 Reproducibility of the peak across biological replicates
R
peakType
用于区分不同条件下调用的峰值类型。control:通过IgG对照样本调用出来的。IgG可以帮助我们过滤掉背景噪音。在实验中,IgG不与特定目标蛋白或修饰结合,因此IgG样本代表了非特异性噪音。调用control峰值有助于识别哪些区域的信号是背景噪音,而不是真正的富集信号。
top0.01:信号最强的前1%的峰值,这些区域通常是高置信度的。通过对信号强度进行排名,提取信号最强的1%区域,可以帮助我们专注于最具生物学意义的区域。这些峰值区域的信号远高于背景噪音,通常反映了真实的染色质修饰或蛋白结合。区分peakType主要目的是更好的区分实验中的真实生物学信号盒技术噪音以及确保实验数据的可靠性并减少由于非特异性噪音导致的错误结果
peakN
peakN = data.fram(peakN = nrow(peakInfo), ...)SEACR工具会对比目标样本(如K27me3)和对照样本(IgG),随即生成一个
.bed文件。这个文件每一行代表一个峰值,包含峰值的起始位置、终止位置和其他统计信息。而peakN通过
nrow()函数提取每个.bed文件的总行数,代表每次对比所产生的峰值总数,从而得到下述表格中的数字peakReprodNum
再现峰值数量,通过对比每个.bed文件中的峰值起始位置和结束位置,统计重叠出现(overlap)的峰值的次数peakReprodRate
1
2//peakReprod = PeakReprodNum
peakReprodRate = peakReprod/peakN * 100再现峰值出现的比例。重叠出现的峰值占全部峰值的比例
| Histone | Replicate | peakType | peakN | peakReprodNum | peakReprodRate |
|---|---|---|---|---|---|
| K27me3 | rep1 | control | 13006 | 31924 | 245.45594 |
| K27me3 | rep1 | top0.01 | 6004 | 5384 | 89.67355 |
| K27me3 | rep2 | control | 251283 | 31924 | 12.70440 |
| K27me3 | rep2 | top0.01 | 9801 | 5384 | 54.93317 |
| K4me3 | rep1 | control | 3692 | 3684 | 99.78332 |
| K4me3 | rep1 | top0.01 | 903 | 898 | 99.44629 |
| K4me3 | rep2 | control | 7049 | 3684 | 52.26273 |
| K4me3 | rep2 | top0.01 | 3765 | 898 | 23.85126 |
6.1.3 FRagment proportion in Peaks regions
R
FragInPeakNum
峰区内的片段数量结合
_bowtie2.mapped.bam文件中的片段位置与.peaks.stringent.bed文件中的峰值位置,得出峰值内的片段数量FRiPs
峰区域内片段的比例。用于评估数据质量
FRiPs = 峰区域内的片段数量 / 总的已映射片段数量(MappedFragNum_hg38) * 100%
| Histone | Replicate | SequencingDepth | MappedFragNum_hg38 | AlignmentRate_hg38 | FragInPeakNum | FRiPs |
|---|---|---|---|---|---|---|
| K27me3 | rep1 | 2984630 | 2861228 | 95.87 | 1200390 | 41.95366 |
| K27me3 | rep2 | 2702260 | 2608071 | 96.51 | 1635121 | 62.69465 |
| K4me3 | rep1 | 1581710 | 1494846 | 94.51 | 1032228 | 69.05246 |
| K4me3 | rep2 | 1885056 | 1743137 | 92.47 | 1166007 | 66.89130 |
6.1.4 Visualization of peak number, peak width, peak reproducibility and FRiPs
7A
峰值数量
比较不同组蛋白和重复实验的峰值数量7B
峰值宽度
展示峰值的宽度分布7C
再现峰值比例
衡量峰值的再现比例7D
片段峰内比例,FRiPs
衡量片段在峰值内的覆盖率
VII. Visualization
7.1. Browser display of normalized bedgraph files.
7.2. Heatmap visualization on specific regions
下载deeptools
1
2
3
4conda install -c bioconda deeptools
//如果conda下载不了
pip install deeptools调教deeptools里的heatmapper.py
将第379行的np.NAN改成np.nan1
2
3
4
5find $(conda info --base) -name "heatmapper.py" //会返还一个路径
nano /路径/heatmapper.py
//找到379行,改正
//然后保存:control+X,Y,回车linux command
1
2
3
4
5
6
7
8mkdir -p $projPath/alignment/bigwig
samtools sort -o $projPath/alignment/bam/${histName}.sorted.bam $projPath/alignment/bam/${histName}_bowtie2.mapped.bam
samtools index $projPath/alignment/bam/${histName}.sorted.bam
bamCoverage -b $projPath/alignment/bam/${histName}.sorted.bam -o $projPath/alignment/bigwig/${histName}_raw.bwhg38.bed
https://support.illumina.com.cn/downloads/enrichment-bed-files-hg38.html
用于下面生成热图时与hg38对齐。下载第一个Illumina Exome Panel v1.27.2.1 Heatmap over transcription units
1
2
3
4
5cores=8
computeMatrix scale-regions -S $projPath/alignment/bigwig/K27me3_rep1_raw.bw $projPath/alignment/bigwig/K27me3_rep2_raw.bw $projPath/alignment/bigwig/K4me3_rep1_raw.bw $projPath/alignment/bigwig/K4me3_rep2_raw.bw -R $projPath/data/hg38_gene/hg38_gene.tsv --beforeRegionStartLength 3000 --regionBodyLength 5000 --afterRegionStartLength 3000 --skipZeros -o $projPath/data/hg38_gene/matrix_gene.mat.gz -p $cores
plotHeatmap -m $projPath/data/hg38_gene/matrix_gene.mat.gz -out $projPath/data/hg38_gene/Histone_gene.png --sortUsing sum7.2.2 Heatmap on CUT&Tag peaks
1
2
3
4
5
6awk '{split($6, summit, ":"); split(summit[2], region, "-"); print summit[1]"\t"region[2]}' $projPath/peakCalling/SEACR/${histName}_${repName}_seacr_control.peaks.stringent.bed >$projPath/peakCalling/SEACR/${histName}_${repName}_seacr_control.peaks.summitRegion.bed
computeMatrix reference-point -S $projPath/alignment/bigwig/${histName}_${repName}_raw.bw -R $projPath/peakCalling/SEACR/${histName}_${repName}_seacr_control.peaks.summitRegion.bed --skipZeros -o $projPath/peakCalling/SEACR/${histName}_${repName}_SEACR.mat.gz -p $cores -a 3000 -b 3000 --referencePoint center
plotHeatmap -m $projPath/peakCalling/SEACR/${histName}_SEACR.mat.gz -out $projPath/peakCalling/SEACR/${histName}_SEACR_heatmap.png --sortUsing sum --startLabel "Peak Start" --endLabel "Peak End" --xAxisLabel "" --regionsLabel "Peaks" --samplesLabel "${histName} ${repName}"