Rabema

M. Holtgrewe, A.-K. Emde, D. Weese and K. Reinert. A Novel And Well-Defined Benchmarking Method For Second Generation Read Mapping, BMC Bioinformatics 2011, 12:210, doi:10.1186/1471-2105-12-21. (Highly Accessed)

Abstract

Rabema is a program that supports a new read mapper benchmark methodology. The methodology is based on a strict definition of the read mapping problem and allows the evaluation of arbitrary read mapping programs that create SAM output.
The paper gives an explanation of the theory behind the benchmark methodology with an example evaluation.

Errata / Additional Online Material

The following files provide the additional material from the paper, errata for the paper and extended supplemental material.

Also see the README file and the Rabema online manual for more information.

Version History

  • current trunk to be included in the 1.1 final
    • Fixes to interval lowering/relabeling.
    • Breaking down intervals by error rate in all mode, too.
    • Updated manual.
  • May 22, 2012 v1.1beta Beta release of a vastyl improved Rabema (SVN revision 11898).
    • Greatly reduced memory usage. (a) If you can map it with RazerS 3, Rabema can build a gold standard with it. (b) The memory requirement for evaluation is slightly above 2x size of largest chromosome of your genome.
    • Rabema can now read BAM as well as SAM.
    • WIT has been renamed to GSI (gold standard intervals) which can be read gzip-compressed and is sorted by query name (i.e. read id).
    • Greatly improved documentation, see Online Manual, Command Line Usage, and Annotated Rabema Reports.
  • May 26, 2011 v1.0 Initial version for the Rabema paper.

Installing

Currently, the only way to install Rabema, is from source. Assuming you have Subversion, CMake make and a compiler installed under Linux/Mac Os X.

svn co http://svn.seqan.de/seqan/trunk seqan-trunk
mkdir seqan-trunk/build
cd seqan-trunk/build
cmake -DCMAKE_BUILD_TYPE=Release ..
make rabema_build_gold_standard rabema_evaluate \
  rabema_prepare_sam razers3
./bin/rabema_build_gold_standard --help
./bin/rabema_evaluate --help

For the Impatient (Getting Started)

After building the Rabema and Razers 3 programs (currently in "extras", soon to go to "core"), you can build the gold standard for one of the example datasets from the rabema-data.tar.bz file (see above).

We have to do the following: (1) Create a perfect mapping SAM file using RazerS 3. (2) Prepare SAM file (by replacing * for duplicative sequences by the actual sequence. (3) Obtain a copy of the file sorted by coordinate. Sorting is easiest done using samtools after conversion to BAM. (4) Build the gold standard using Rabema (rabema_build_gold_standard). Details on the parameters can be found in the Rabema manual.

./bin/razers3 --dont-shrink-alignments --verbose \
  --recognition-rate 100 \
  --percent-identity 92 \
  --max-hits 10000000 \
  --output out_gold.sam \
  rabema-data/data/saccharomyces/genome.fasta \
  rabema-data/data/saccharomyces/reads_454/SRR000853.10k.fastq
./bin/rabema_prepare_sam \
  -i out_gold.sam -o out_gold.prep.sam

Note that by default rabema_prepare_sam assumes that the input SAM file is sorted by query name in the same order as produced by samtools sort -n. If your file is not sorted in this way then rabema_prepare_sam will stop with a failing sanity check. In this case, call it with the --dont-check-sorting parameter like this: ./bin/rabema_prepare_sam -i out_gold.sam --dont-check-sorting -o out_gold.prep.sam.

samtools view -Sb out_gold.prep.sam >out_gold.bam
samtools sort out_gold.bam out_gold.by_coord
./bin/rabema_build_gold_standard \
  --max-error 8 \
  --distance-metric edit \
  --out-gsi gold_standard.gsi \
  --reference rabema-data/data/saccharomyces/genome.fasta \
  --in-bam out_gold.by_coord.bam

Next, compare the results of another read mapper against the gold standard. Here, we use the RazerS with default parameters. First, run RazerS.

./bin/razers3 -vv -o out_default.sam \
  rabema-data/data/saccharomyces/genome.fasta \
  rabema-data/data/saccharomyces/reads_454/SRR000853.10k.fastq

Then, compare the result with the gold standard. Note that the error rate used here is independent of the one built in the standard and only has to be less than or equal to the error rate used there. Let's have a look at the output in the category all.

./bin/rabema_evaluate \
  --max-error 8 \
  --distance-metric edit \
  --benchmark-category all \
  --reference rabema-data/data/saccharomyces/genome.fasta \
  --in-sam out_default.sam \
  --in-gsi gold_standard.gsi

At the end of the output, there will be the following report:

Intervals to find:              10839
Intervals found:                10839
Intervals found [%]             100
Invalid alignments:             312
Additional Hits:                0

Number of reads:                8840
Number of reads with intervals: 7723

Mapped reads:                   7723
Mapped reads [% of total]:      87.3643
Mapped reads [% of mappable]:   100

Normalized intervals found:     7723
Normalized intervals found [%]: 100

Found Intervals By Error Rate

  ERR       #max      #found      %found    norm max    norm found  norm found [%]
------------------------------------------------------------------------------------
    0       2780        2780      100.00     2095.80       2095.80      100.00
    1       4221        4221      100.00     3083.12       3083.12      100.00
    2       2721        2721      100.00     1849.09       1849.09      100.00
    3       1117        1117      100.00      694.99        694.99      100.00

In total, there were 10,839 intervals with up to 3% errors to be found and RazerS found all of them (the reads are very long so there is fewer room for ambiguity). The number of invalid alignments is 312. These alignments are caused by the defaults setting for allowed error rate of RazerS being 8%. The term "incorrect" is fixed to the given error rate in the benchmark. If we passed an error rate of 8% to the evaluation then there would be no invalid alignments. The number of additional hits is 0. This is the number of hits in the read mapper output with a valid error rate (below 3% in this case) that are not found in the gold standard. If this number is greater than zero then an error occured while building the gold standard or in the evaluation program. If you get such a number then please contact the Rabema authors.

The total number of reads is 8,840, the number of reads having an alignment with less than or equal to 3% error is 7,723 which also is the largest number of "normalized intervals" to be found. Each read contributes at most one point to the "normalized intervals score" achieved by a read mapper. Each interval for a read contributes 1/x points where x is the number of alignments for the read. A total of 7,723 reads could be mapped, which is 100% of all mappable reads and 87.4% of all reads.

The table below gives a further breakdown of the found intervals and normalized found intervals by error rate of the alignment. The data is broken down by error rate of the given alignment. For example, there were 2,780 of 2,780 intervals found with error rate 0, which amounts to 100% of such reads. There were 2,095.8 normalized such intervals of which all were found.

A detailed description of how to use Rabema can be found in Rabema Manual. The programs' command line interface is documented in Rabema Command Line Interface and Description of Rabema Reports contains an annotation of the reports generated by Rabema.

Last Update 29. October 2013