0%

BOWTIE2-进行基因组比对

整理ChIP-seq / CUT & Tag 分析时用到的工具。本文只对使用的工具用法进行简单介绍。

Bowtie 2是常用的基因组比对软件。其原理在此展开,有兴趣的同学可以参阅其官方文档以及其发表的文章(https://doi.org/10.1038/nmeth.1923)。下面简单介绍Bowtie 2 Index和比对的命令及个人常用参数。

用法

Index

1
bowtie2-build [options]* <reference_in> <bt2_base>

:如果此处使用-f 参数,则指明index的参考fasta 文件;如果使用-c参数,则指明index的参考序列,例如,GGTCATCCT,ACGGGTCGT,CCGTTCTATGCGGCTTA.
:指的是生成的index文件的前缀,默认情况,bowtie2-build产生NAME.1.bt2, NAME.2.bt2, NAME.3.bt2, NAME.4.bt2, NAME.rev.1.bt2, and NAME.rev.2.bt2, where NAME is .
--threads 使用的线程数

Index例子

1
bowtie2-build -f /public/Reference/GRCh38.primary_assembly.genome.fa --threads 24 GRCh38

上述命令使用该fasta文件/public/Reference/GRCh38.primary_assembly.genome.fa ,在当前位置产生前缀为GRCh38的index文件。

Alignment

单端测序比对

1
bowtie2 [options]* -x <bt2-idx> -U <fq> -S <sam_output> -p <threads> 2>Align.summary

-x:参考基因组index文件的前缀(包括路径)
-U:单端测序的fastq文件
-S:输出的SAM文件,包含比对结果
-p:使用的线程数
2>Align.summary:将输出到屏幕的标准误(standard error)重导向到”Align.summary”文件,其格式通常如下

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
## Single-end
20000 reads; of these:
20000 (100.00%) were unpaired; of these:
1247 (6.24%) aligned 0 times
18739 (93.69%) aligned exactly 1 time
14 (0.07%) aligned >1 times
93.77% overall alignment rate

## Paired-end
10000 reads; of these:
10000 (100.00%) were paired; of these:
650 (6.50%) aligned concordantly 0 times
8823 (88.23%) aligned concordantly exactly 1 time
527 (5.27%) aligned concordantly >1 times
----
650 pairs aligned concordantly 0 times; of these:
34 (5.23%) aligned discordantly 1 time
----
616 pairs aligned 0 times concordantly or discordantly; of these:
1232 mates make up the pairs; of these:
660 (53.57%) aligned 0 times
571 (46.35%) aligned exactly 1 time
1 (0.08%) aligned >1 times
96.70% overall alignment rate
The indentation indicates how subtotals relate to t

双端测序比对

1
bowtie2 [options]* -x <bt2-idx> -1 <fq1> -2 <fq2> -S <sam_output> -p <threads> 2>Align.summary

双端比对模式基本与单端一致,只需替换fastq文件传入的参数即可
-1:一链fastq文件
-2:二链fastq文件

Bowtie2 还有更多详细的比对参数可以调整,这里就不一一介绍了。下面再介绍其输出的SAM文件中各列的含义。

Alignment OUTPUT

比对结果以SAM文件保存。SAM文件的每一行代表一个reads的比对情况,至少包含了12列(tab分割),从左往右,每一列的含义依次为:

  1. Read的名字

  2. flags之和

    在bowtie2中,flags的含义为
    1
    The read is one of a pair
    2
    The alignment is one end of a proper paired-end alignment
    4
    The read has no reported alignments
    8
    The read is one of a pair and has no reported alignments
    16
    The alignment is to the reverse reference strand
    32
    The other mate in the paired-end alignment is aligned to the reverse reference strand
    64
    The read is mate 1 in a pair
    128
    The read is mate 2 in a pair
    注意每个比对软件flags的含义有所区别

  3. 比对到的参考基因组染色体名称

  4. read 5’端比对到的参考基因组正链染色体坐标(1-based)

  5. 比对质量

  6. CIGAR字符串,用以表征比对的结果

  7. 双端测序中,二链所比对上的染色体名称,如果与一链相同则为=,如果没有二链则为*

  8. 双端测序中,二链read 5’端比对到的参考基因组正链染色体坐标(1-based),如果没有二链则为0

  9. 推测的一链与二链之间的片段长度。该值为负表明,二链比对到一链的上游;该值为0表明二链没有比对上;该值为non-0表明二链与一链比对到不同的染色体上(non-0如何理解?)

  10. Read的序列

  11. ASCII 编码的read碱基质量

  12. 可选的列,包括以下这些

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    AS:i:<N> Alignment score. Can be negative. Can be greater than 0 in --local mode (but not in --end-to-end mode). Only present if SAM record is for an aligned read. 
    XS:i:<N> Alignment score for the best-scoring alignment found other than the alignment reported. Can be negative. Can be greater than 0 in --local mode (but not in --end-to-end mode). Only present if the SAM record is for an aligned read and more than one alignment was found for the read. Note that, when the read is part of a concordantly-aligned pair, this score could be greater than AS:i.
    YS:i:<N> Alignment score for opposite mate in the paired-end alignment. Only present if the SAM record is for a read that aligned as part of a paired-end alignment.
    XN:i:<N> The number of ambiguous bases in the reference covering this alignment. Only present if SAM record is for an aligned read.
    XM:i:<N> The number of mismatches in the alignment. Only present if SAM record is for an aligned read.
    XO:i:<N> The number of gap opens, for both read and reference gaps, in the alignment. Only present if SAM record is for an aligned read.
    XG:i:<N> The number of gap extensions, for both read and reference gaps, in the alignment. Only present if SAM record is for an aligned read.
    NM:i:<N> The edit distance; that is, the minimal number of one-nucleotide edits (substitutions, insertions and deletions) needed to transform the read string into the reference string. Only present if SAM record is for an aligned read.
    YF:Z:<S> String indicating reason why the read was filtered out. See also: Filtering. Only appears for reads that were filtered out.
    YT:Z:<S> Value of UU indicates the read was not part of a pair. Value of CP indicates the read was part of a pair and the pair aligned concordantly. Value of DP indicates the read was part of a pair and the pair aligned discordantly. Value of UP indicates the read was part of a pair but the pair failed to aligned either concordantly or discordantly.
    MD:Z:<S> A string representation of the mismatched reference bases in the alignm

以上就是对Bowtie 2进行基因组比对的一些总结,以后有新的心得再做补充。

Ref:
http://bowtie-bio.sourceforge.net/bowtie2/manual.shtml#how-is-bowtie-2-different-from-bowtie-1

完。