HPC Lab

BMI - OSU





HPC Lab

pMap: Parallel Sequence Mapping Tool

General information

pMap is an MPI-based tool to parallelize the alignment step of state-of-the-art sequence mapping programs. It allows transparent execution of the alignment step of a selected program in parallel on a compute-cluster. The results are the same as those from a sequential run of the selected mapping program, except for differences due to random choices within the mapping program. The parallel execution using pMap is approximately NxP times faster, where N is the number of nodes and P is the number of cores/threads per node. Currently, pMap supports BWA, SOAP, Bowtie, GSNAP, MAQ and RMAP.

pMap divides the workload by assigning approximately R/N reads to each node, where R is the number of reads provided in a formatted input file (e.g. FASTA or FASTQ). Moreover, it replicates the pre-built index file(s) associated with the reference genome on each node. The selected sequence mapping program is executed independenty to map the reads assigned to each node, and finally, the output files are combined and written to a desired location. pMap consists of three subprograms corresponding to different steps of the parallel mapping process: pmap_index, pmap_dist and pmap.

Download

Latest release: pMap-11-25-2010.tar.gz

Compiling

After extracting the archive, change the default compilers mpicc and gcc in the Makefile, if desired. Then, run make to create three executables: pmap_index, pmap_dist and pmap. Make sure that the mapping programs to be used with pMap are included in the $PATH.

Usage


pmap_index

pmap_index <genomefile> <indexdir> <indexprefix> <program> [programpars]

genomefile: file that contains the genome/chromosome sequences in FASTA format.
indexdir: directory in which output index files will be written
indexprefix: the prefix of the index files in indexdir
program: the sequence mapping program to be called (bowtie, bwa or soap)
programpars: additional parameters to be passed to program

When pmap_index is called:

* Example: Generate index files for Bowtie
#!/bin/bash
genomefile=/home/myuser/mygenome.fa
indexdir=/home/myuser/myindexdir
indexprefix=myindexprefix
pmap_index $genomefile $indexdir $indexprefix bowtie

#Sequential equivalent is:
bowtie-build $genomefile $indexdir/$indexprefix

* Example: Generate index files for BWA with "-a bwtsw" option
#!/bin/bash
genomefile=/home/myuser/mygenome.fa
indexdir=/home/myuser/myindexdir
indexprefix=myindexprefix
pmap_index $genomefile $indexdir $indexprefix bwa -a bwtsw

#Sequential equivalent is:
bwa index -a bwtsw -p $indexdir/$indexprefix $genomefile


pmap_dist

pmap_dist <workdir> <outdir> <readsfile> [-r readsfile2] [-l num_lines_per_read] [-i indexdir indexprefix] [-f inpfile]

workdir: local directory in which intermediate files will be written
outdir: directory on the head node in which output files will be written
readsfile: the reads file on the head node that contains the reads to be mapped
readsfile2: the second reads file on the head node needed when mapping paired end reads
num_lines_per_read: number of lines per read in readsfile and readsfile2
indexdir: shared directory in which index files are stored
indexprefix: the prefix of the index files in indexdir
inpfile: an input file on the head node that contains additional information needed during mapping

When pmap_dist is called:

Please also see pmap examples below.


pmap

pmap [options] <workdir> <outdir> <program> [programpars] [-conversion <convpars>]

options: pmap has the following optional parameters:
      -pe: paired end mapping. readsfile2 must have been distributed by pmap_dist if needed by the program.
-dc: don't cleanup intermediate files in workdir after the execution of pmap completes.
-i <indexdir> <indexprefix>: use index files from a shared directory. The index files are in indexdir and file names start with indexprefix. If -i is not specified, index files must have been replicated using -i option of pmap_dist

workdir: local work directory created by pmap_dist
outdir: directory on the head node in which output files will be written
program: the sequence mapping program to be called (bowtie, bwa, soap, gsnap, maq or rmap)
programpars: additional parameters to be passed to program. If programpars includes an output file, the output file name should be specified in outfile:filename format. If programpars includes an input file which is not to be read from a shared location, the input file should be replicated using the -f <inpfile> option of pmap_dist. Then, in programpars, the input file name should be specified in infile:filename format. For both input and output files, filename is just the name of the file, not a path.
convpars: additional paramaters of the conversion program, if any, that is used to convert intermediate output files from program-native format to human-readable format. [-conversion <convpars>] should be the last parameters of pmap

When pmap is called:

* Examples for running of pmap_dist and pmap in a PBS environment are given below. Commands for executing equivalent runs sequentially are also given in program specific examples. In all of the following examples assume that 4 nodes with 8 processors each are used for pMap and the following variables are set:
workdir=/temp
outdir=/home/myuser
readsfile=/home/myuser/reads.fq
readsfile2=/home/myuser/reads2.fq #if needed
indexdir=/home/myuser/myindexdir
indexprefix=myindexprefix

* Example: Map single end reads using Bowtie. Index files are at a shared location.
#PBS -l nodes=4:ppn=8
mpiexec -n 32 pmap_dist $workdir $outdir $readsfile
mpiexec -n 32 pmap -i $indexdir $indexprefix $workdir $outdir bowtie

* Example: Map paired end reads using BWA. Index files are replicated.
#PBS -l nodes=4:ppn=8
mpiexec -n 32 pmap_dist $workdir $outdir $readsfile -r $readsfile2 -i $indexdir $indexprefix
mpiexec -n 32 pmap -pe $workdir $outdir bwa

* Example: Map paired end reads using SOAP. Index files are replicated. Use -p 8 option of soap to use 8 threads per node.
#PBS -l nodes=4:ppn=8
uniq $PBS_NODEFILE > $ids # Get ids of the nodes and run a single instance with 8 threads on each node
mpiexec -n 4 -machinefile $ids pmap_dist $workdir $outdir $readsfile -r $readsfile2 -i $indexdir $indexprefix
mpiexec -n 4 -machinefile $ids pmap -pe $workdir $outdir soap -p 8 -2 outfile:unpaired.txt

Please also note the following for pmap_dist and pmap:


Using pMap with BWA

When pmap is called for BWA program, bwa aln is executed with appropriate parameters on each node to map the group of reads assigned to the node by pmap_dist. Subsequently, the generated output files in BWA-native ".sai" format are converted to SAM format by executing bwa samse for single end reads and bwa sampe for paired end reads. The optional parameters of bwa sampe or bwa samse can be provided to pmap by using the -conversion option of pmap.

* Example: Map paired end reads. Index files are replicated. Use -l 28 option with bwa aln and -a 400 option with bwa sampe.
#PBS -l nodes=4:ppn=8
mpiexec -n 32 pmap_dist $workdir $outdir $readsfile -r $readsfile2 -i $indexdir $indexprefix
mpiexec -n 32 pmap -pe $workdir $outdir bwa -l 28 -conversion -a 400

#Sequential equivalent is:
bwa aln -l 28 $indexdir/$indexprefix $readsfile > $workdir/out.sai
bwa aln -l 28 $indexdir/$indexprefix $readsfile2 > $workdir/out2.sai
bwa sampe -a 400 $indexdir/$indexprefix $workdir/out.sai $workdir/out2.sai $readsfile $readsfile2 > $outdir/out.txt

For more information on BWA: http://bio-bwa.sourceforge.net/
(pMap was tested using version 0.5.0 of BWA)


Using pMap with SOAP

When pmap is called for SOAP program, soap is executed with -D option and appropriate parameters on each node to map the group of reads assigned to the node by pmap_dist.

-a,-b,-D,-o options of SOAP are automatically used within pmap and therefore should not be specified in programpars.

* Example: Map paired end reads. Index files are replicated. Write unpaired reads into unpaired.txt using -2 option of soap.
#PBS -l nodes=4:ppn=8
mpiexec -n 32 pmap_dist $workdir $outdir $readsfile -r $readsfile2 -i $indexdir $indexprefix
mpiexec -n 32 pmap -pe $workdir $outdir soap -2 outfile:unpaired.txt

#Sequential equivalent is:
soap -D $indexdir/$indexprefix -a $readsfile -b $readsfile2 -o $outdir/out.txt -2 $outdir/unpaired.txt

For more information on SOAP: http://soap.genomics.org.cn/
(pMap was tested using version 2.20 of SOAP)
Note: Not using "-2 <str>" option when mapping paired end reads generates an error in SOAP.


Using pMap with Bowtie

When pmap is called for Bowtie program, bowtie is executed with appropriate parameters on each node to map the group of reads assigned to the node by pmap_dist. The BOWTIE_INDEXES environment variable must be set to the folder name that contains the index files (e.g. the following line should be present in .bashrc file: BOWTIE_INDEXES=/home/myuser/myindexdir; export BOWTIE_INDEXES. Also see Bowtie manual). Currently, using Bowtie with replicated index files (i.e. -i option of pmap_dist) is not supported by pMap. Therefore it should be used with shared index files (i.e. using -i option of pmap).

--12,-c,-s,-u,-b,--refout,--mm,-f (with ".mfa" files) options of Bowtie are not supported by pmap.
-1,-2 options of Bowtie are automatically used within pmap and therefore should not be specified in programpars.

* Example: Map paired end reads first by using default options, then by using the --best option of bowtie. Using the -dc option of pmap in the first run allows reusing the files distributed/replicated by pmap_dist in the second run. Index files are at a shared location.
#PBS -l nodes=4:ppn=8
outdir1=/home/myuser/myoutdir1 #Write output of first run here
outdir2=/home/myuser/myoutdir2 #Write output of second run here
mpiexec -n 32 pmap_dist $workdir $outdir $readsfile -r $readsfile2
mpiexec -n 32 pmap -pe -dc -i $indexdir $indexprefix $workdir $outdir1 bowtie
mpiexec -n 32 pmap -pe -i $indexdir $indexprefix $workdir $outdir2 bowtie --best

#Sequential equivalent is:
outdir1=/home/myuser/myoutdir1 #Write output of first run here
outdir2=/home/myuser/myoutdir2 #Write output of second run here
bowtie $indexprefix -1 $readsfile -2 $readsfile2 $outdir1/out.txt
bowtie --best $indexprefix -1 $readsfile -2 $readsfile2 $outdir2/out.txt

For more information on Bowtie: http://bowtie-bio.sourceforge.net/
(pMap was tested using version 0.10.1 of Bowtie)


Using pMap with GSNAP

When pmap is called for GSNAP program, gsnap is executed with appropriate parameters on each node to map the group of reads assigned to the node by pmap_dist. When using pMap with GSNAP, the indexdir should be set to the database directory that corresponds to the genome being considered. Also, indexprefix should be set to the name of the database. GSNAP uses a single input file when mapping paired end reads, therefore readsfile2 paramater is not needed.

-D,-d options of GSNAP are automatically used within pmap and therefore should not be specified in programpars.

* Example: Map paired end reads. The readsfile is in a special format that contains 3 lines per read. Use known SNP information in /home/myuser/snp.iit file using -V option of gsnap. Replicate the SNP information file as well as the index files (database files).
#PBS -l nodes=4:ppn=8
indexdir=/home/myuser/mygenomedb
indexprefix=mygenomedb
inpfile=/home/myuser/snp.iit
mpiexec -n 32 pmap_dist $workdir $outdir $readsfile -l 3 -i $indexdir $indexprefix -f $inpfile
mpiexec -n 32 pmap -pe $workdir $outdir gsnap -V infile:snp.iit

#Sequential equivalent is:
indexdir=/home/myuser/mygenomedb
indexprefix=mygenomedb
inpfile=/home/myuser/snp.iit
gsnap -V $inpfile -D $indexdir -d $indexprefix $readsfile > $outdir/out.txt

For more information on GSNAP: http://research-pub.gene.com/gmap/
(pMap was tested using version 2010-07-27 of GMAP/GSNAP)


Using pMap with MAQ

When pmap is called for MAQ program, first maq fastq2bfa and maq fastq2bfq are called to convert the input genome and the group of reads assigned to the node to an maq-internal format (optional [-n nreads] parameter of maq fastq2bfa is ignored). Then, maq map is executed with appropriate parameters on each node to map the reads. Subsequently, the generated output files in MAQ-native ".map" format are converted to a human-readable format by executing maq mapview. The optional parameters of maq mapview can be provided to pmap by using the -conversion option of pmap. Since there's no distinct indexing step in MAQ, the indexdir and indexprefix paramaters of pmap_dist and pmap should be set to the directory of the genome file and the genome file name respectively.

* Example: Map paired end reads. Index file (genome file) is replicated.
#PBS -l nodes=4:ppn=8
indexdir=/home/myuser
indexprefix=mygenome.fa
mpiexec -n 32 pmap_dist $workdir $outdir $readsfile -i $indexdir $indexprefix
mpiexec -n 32 pmap -pe $workdir $outdir maq

#Sequential equivalent is:
indexdir=/home/myuser
indexprefix=mygenome.fa
maq fasta2bfa $indexdir/$indexprefix $workdir/index.bfa
maq fastq2bfq $readsfile $workdir/reads.bfq
maq fastq2bfq $readsfile2 $workdir/reads2.bfq
maq map $workdir/out.map $workdir/index.bfa $workdir/reads.bfq $workdir/reads2.bfq
maq mapview $workdir/out.map > $outdir/out.txt

For more information on MAQ: http://maq.sourceforge.net/
(pMap was tested using version 0.7.1 of MAQ)


Using pMap with RMAP

When pmap is called for RMAP program, rmap is executed with -Q option and appropriate parameters on each node to map the group of reads assigned to the node by pmap_dist. Since there's no distinct indexing step in RMAP, the indexdir and indexprefix paramaters of pmap_dist and pmap should be set to the directory of the genome file and the genome file name respectively. The genome should be stored in a file in FASTA format, which may contain one or multiple chromosomes.

-s,-F,-p options of RMAP are not supported by pmap.
-c,-Q,-o options of RMAP are automatically used within pmap and therefore should not be specified in programpars.

* Example: Map paired end reads. Index file (genome file) is replicated.
#PBS -l nodes=4:ppn=8
indexdir=/home/myuser
indexprefix=mygenome.fa
mpiexec -n 32 pmap_dist $workdir $outdir $readsfile -i $indexdir $indexprefix
mpiexec -n 32 pmap -pe $workdir $outdir rmap

#Sequential equivalent is:
indexdir=/home/myuser
indexprefix=mygenome.fa
rmappe -c $indexdir/$indexprefix -Q $readsfile -o $outdir/out.txt

For more information on RMAP: http://rulai.cshl.edu/rmap/
(pMap was tested using version 2.05 of RMAP)