HPC Lab

BMI - OSU





HPC Lab

Benchmarking Short Sequence Mapping Tools

General information

The development of next-generation sequencing instruments has led to the generation of millions of short sequences in a single run. The process of aligning these reads to a reference genome is time consuming and demands the development of fast and accurate alignment tools. However, the current proposed tools make different compromises between the accuracy and the speed of mapping. Moreover, many important aspects are overlooked while comparing the performance of a newly developed tool to the state of the art. Therefore, there is a need for an objective evaluation method that covers all the aspects. In this work, we introduce a benchmarking suite to extensively analyze sequencing tools with respect to various aspects and provide an objective comparison. Information about the tools and the options we used in the experiments are shown in the following. In addition, the code used to verify the tools is included.

Related Publications

Experimental setup

The experiments in this study can be repeated following three major steps: getting the reference genomes, generating the synthetic data sets, and choosing the right options for the tools. Each one of these are described in detail below. If needed, instead of regenerating datasets, you can also download them from here.

  1. Getting the reference genomes:
  2. The reference genomes are available online from different websites. For our experiments, we downloaded the genomes from UCSC Genome Bioinformatics Center (http://genome.ucsc.edu/). There are also other available reference genomes.

  3. Generating the synthetic data:
  4. To generate the synthetic data, we used wgsim, which is part of SAMtools package ( http://samtools.sourceforge.net/ ). You can use different options for wgsim to emulate the base error rate, the mutation rate, and percentage of indels beside other options. For our experiments, we used the following options with the following values:
    -e 0.02 (default value), -r 0.0009, -R 0.0001
    In addition, based on every experiments, we changed -N (total number of reads) and -1 and -2 (length for the first and second read).
    In addition to wgsim, we used ART ( http://www.niehs.nih.gov/research/resources/software/biostatistics/art/ ) to generate reads with a varying sequencing error rate. The default options were used.

  5. Choosing the right options for the tools:
  6. It is important to disable and enable the right options beside choosing the right values for them. In the following, we show how the different options are used to run a fair comparison. It is important to note that in all of the experiments we used pMap on a single node to provide the execution time for some of the tools; not all of the tools provide the execution time. Therefore, we first explain what pMap is and how to use it. Then, we mention the scripts we used to run pMap with the different tools.

    pMap:
    pMap ( http://bmi.osu.edu/hpc/software/pmap/pmap.html) is an open-source implementation of MPI-based tool that enable parallelization of existing short sequence mapping tools. Currently, it supports the following tools: Bowtie, BWA, SOAP, GSNAP, MAQ, and RMAP. In addition, it can be extended easily to integrate other tools.
    The followings are the main commands to use pMap:
    pmap_index $genomefile $indexdir $indexprefix $programname
    pmap_dist $workdir $outdir $readsfile [-r $readfile2]
    pmap [-pe](paired end) -i $indexdir $indexprefix $workdir $outdir $programname $options

    1. Bowtie
    2. First of all, before calling the mapping program, bowtie needs the bowtie_indexes environment variable to contain the location of the reference genome index.
      BOWTIE_INDEXES=/home/dayat/out-bowtie/index/lancelet; export BOWTIE_INDEXES
      The experiments specific options are as follows:
      • Quality threshold: -e 140 -n 2 -l 28 -S
      • Number of mismatches: -n 2 -l 28 -S -e (40, 60, 80, 100, 120, 140)
      • Seed length: -n 2 -l (20, 24, 28, 32, 36) -e 100 -S
      • Read length: -n 2 -l 28 -e 100 -S
      • Paired end: -n 2 -l 28 -e 100 -S -I 0 -X 500
      • Genome type: -n 2 -l 28 -e 100 -S
      • Performance: -n 2 -l 28 -e 100 -S -p (2, 4, 8)

    3. Bowtie2
    4. Bowtie2 is not supported by pMap. To run Bowtie2, the following command is used:
      Bowtie2 -t --ignore-quals $indexdir/indexprefix -U $readfile -S $workdir/out.txt
      The experiments specific options are as follow:
      • Quality threshold: --score-min L,-21,0 --mp 3,3 --gbar 125
      • Number of mismatches: --score-min L,-(6, 9, 12, 15, 18, 21),0 --mp 3,3 --gbar 125
      • Read length: --score-min L,-15,0 --mp 3,3 --gbar (36, 70, 125)
      • Paired end: --score-min L,-15,0 --mp 3,3 (for ungapped --gbar 70) --no-discordant --no-mixed
      • Genome type: - --score-min L,-15,0 --mp 3,3 --gbar 125
      • Gaped alignment: --score-min L,-15,0 --mp 3,3
      • Performance: -p 2 --score-min L,-15,0 --mp 3,3 --gbar 125

    5. BWA
    6. Experiments specific options:
      • Quality threshold: -n 5 -l 28 -k 2 -o 0
      • Number of mismatches: -l 28 -k 2 -o 0 -n (2, 3, 4, 5, 6, 7)
      • Seed length: -n 5 -l (20, 24, 28, 32, 36) -k 2 -o 0
      • Read length: -n 5 -l 28 -k 2 -o 0
      • Paired end: -n 5 -l 28 -k 2 -o 0 (or -o 1 -e 3 for gaped alignment) -conversion -a 500 -s(disable Smith-Waterman alignment)
      • Genome type: -n 5 -l 28 -k 2 -o 0
      • Gaped alignment: -n 5 -l 28 -k 2 -o 1 -e 3
      • Performance: I-n 5 -l 28 -k 2 . o 0 -t(2, 4, 8)

    7. SOAP2
    8. Experiments specific options:
      • Quality threshold: -l 28 -v 7
      • Number of mismatches: -l 28 -v (2, 3, 4, 5, 6, 7)
      • Seed length: -l (28, 32, 36) -v 5
      • Read length: -l 28 -v 5
      • Paired end: -l 28 -v 5 -m 0 -x 500 -2 outfile:unpaired.txt
      • Genome type: -l 28 -v 5
      • Gaped alignment (only for paired end): -g 3
      • Performance: -l 28 -v 5 -p (2, 5, 8)

    9. GSNAP
    10. Experiments specific options:
      • Quality threshold: -m 7 -n 1 -w 0 -T 0 -A sam
      • Number of mismatches: -m (2, 3, 4, 5, 6, 7) -n 1 -w 0 -T 0 -A sam
      • Seed length: does not support seeding
      • Read length: -m 5 -n 1 -w 0 -T 0 -A sam
      • Paired end: -m 5 -n 1 -w 0 -T 0 -A sam ( -i 0 -y 3 -Y 3 -z 3 -Z 3 for gaped alignment)
      • Genome type: -m 5 -n 1 -w 0 -T 0 -A sam
      • Gaped alignment: -m 5 -n 1 -w 0 -T 0 -A sam -i 0 -y 3 -Y 3 -z 3 -Z 3
      • Performance: -m 5 -n 1 -w 0 -T 0 -A sam -t (2, 4, 8)

    11. Novoalign
    12. Novoalign is not yet supported by pMap. Therefore, to run Novoalign, call the following command:
      novoalign -f $readsfile -o SAM -d $indexdir/$indexprefix > $workdir/out.txt
      Experiments specific options:
      • Quality threshold: -t 154 -o SAM -o FullNW -g 99 -x 99 -r Random
      • Number of mismatches: -t (44, 66, 88, 110, 132, 154) -o SAM -o FullNW -g 99 -x 99 -r Random
      • Read length: -t 110 -o SAM -o FullNW -g 99 -x 99
      • Paired end: -t 110 -o SAM -o FullNW (for ungapped alignment -g 99 -x 99) -i 500 50
      • Genome type: -t 110 -o SAM -o FullNW -g 99 -x 99
      • Gaped alignment: -t 110 -o SAM -o FullNW

    13. MAQ
    14. Experiments specific options:
      • Quality threshold: -e 140
      • Number of mismatches: -e (40, 60, 80, 100, 120, 140)
      • Read length: -e 100
      • Read length: -e 100
      • Genome type: -e 100

    15. RMAP
    16. Experiments specific options:
      • Quality threshold: -m 7 -w 125 -a outfile:amb.txt
      • Number of mismatches: -m (2, 3, 4, 5, 6, 7) -w 125
      • Read length: -m 5 -w (36, 70, 125, 200, 300)
      • Paired end: -m 5 -w 70 -min-sep 0 -max-sep 500 -a outfile:amb.txt
      • Genome type: -m 5 -w 70 -a outfile:amb.txt

    17. FANGS
    18. FANGS is not yet supported by pMap. Therefore, to run FANGS, call the following command:
      fangs $indexdir/$indexprefix $readsfile -out=pslx $workdir/out.txt
      FANGS is used only in one experiment (Read length) and there is no specific options needed to use for this experiment. In addition, by default, it allows five mismatches.

Download

Latest release: verifTools.tar

Genomes

Human genome: human.fa.gz (905MB)

Chimp genome: chimp.fa.gz (795MB)

Mouse genome: mouse.fa.gz (801MB)

Zebrafish genome: zebrafish.fa.gz (395MB)

Lancelet genome: lancelet.fa.gz (265MB)

Mellifera genome: mellifera.fa.gz (72MB)

C.Elegans genome: elegans.fa.gz (31MB)

wgsim synthetic data

Human data

Read length 36, paired end: human.36.1M.1.fq.gz (26MB), human.36.1M.2.fq.gz (26MB)

Read length 70, paired end: human.70.1M.1.fq.gz (37MB), human.70.1M.2.fq.gz (37MB)

Read length 125, paired end: human.125.1M.1.fq.gz (53MB), human.125.1M.2.fq.gz (53MB)

Read length 200, paired end: human.200.1M.1.fq.gz (75MB), human.200.1M.2.fq.gz (75MB)

Read length 300, paired end: human.300.1M.1.fq.gz (105MB), human.300.1M.2.fq.gz (105MB)

Read length 500, paired end: human.500.1M.1.fq.gz (161MB), human.500.1M.2.fq.gz (161MB)

Chimp data

Read length 70, paired end: chimp.1M.1.fq.gz (36MB), chimp.1M.2.fq.gz (36MB)

Mouse data

Read length 70, paired end: mouse.1M.1.fq.gz (37MB), mouse.1M.2.fq.gz (37MB)

Zebrafish data

Read length 70, paired end: zebrafish.1M.1.fq.gz (38MB), zebrafish.1M.2.fq.gz (38MB)
Lancelet data

Read length 70, paired end: lancelet.1M.1.fq.gz (37MB), lancelet.1M.2.fq.gz (37MB)

Mellifera data

Read length mellifera, paired end: mellifera.1M.1.fq.gz (24MB), mellifera.1M.2.fq.gz (24MB)

C.Elegans data

Read length 70, paired end: cElegans.1M.1.fq.gz (38MB), cElegans.1M.2.fq.gz (38MB)

ART synthetic data

Human, read length 70: human.70.1M.fq.gz (74MB)

Human, read length 100: human.100.1M.fq.gz (97MB)

Chimp, read length 70: chimp.1M.fq.gz (74MB)

Mouse, read length 36: mouse.36.1M.fq.gz (43MB)

Mouse, read length 70: mouse.70.1M.fq.gz (77MB)

Mouse, read length 100: mouse.100.1M.fq.gz (100MB)

Zebrafish, read length 70: zebrafish.1M.fq.gz (75MB)

Lancelet, read length 70: lancelet.1M.fq.gz (74MB)

Mellifera, read length 70: mellifera.1M.fq.gz (64MB)

C.Elegans, read length 70: cElegans.1M.fq.gz (75MB)