Randomly subsample sequencing reads to a specified coverage.
Hall, M. B., (2022). Rasusa: Randomly subsample sequencing reads to a specified coverage. Journal of Open Source Software, 7(69), 3941, https://doi.org/10.21105/joss.03941
I couldn’t find a tool for subsampling reads that met my
requirements. All the strategies I could find fell short as they either
just wanted a number or percentage of reads to subsample to or, if they
did subsample to a coverage, they assume all reads are the same size
(i.e Illumina). As I mostly work with long-read data this posed a
problem if I wanted to subsample a file to certain coverage, as length
of reads was never taken into account. rasusa
addresses
this shortcoming.
A workaround I had been using for a while was using filtlong
. It
was simple enough, I just figure out the number of bases I need to
achieve a (theoretical) coverage for my sample. Say I have a fastq from
an E. coli sample with 5 million reads and I want to subset it
to 50x coverage. I just need to multiply the expected size of the
sample’s genome, 4.6 million base pairs, by the coverage I want and I
have my target bases - 230 million base pairs. In filtlong
,
I can do the following
target=230000000
filtlong --target_bases "$target" reads.fq > reads.50x.fq
However, this is technically not the intended function of
filtlong
; it’s a quality filtering tool. What you get in
the end is a subset of the “highest
scoring” reads at a (theoretical) coverage of 50x. Depending on your
circumstances, this might be what you want. However, you bias yourself
towards the best/longest reads in the dataset - not a fair
representation of your dataset as a whole. There is also the possibility
of favouring regions of the genome that produce longer/higher quality
reads. De Maio et
al. even found that by randomly subsampling nanopore reads you
achieve better genome assemblies than if you had filtered.
So, depending on your circumstances, an unbiased subsample of your
reads might be what you need. And if this is the case,
rasusa
has you covered.
rasusa --input in.fq --coverage 30 --genome-size 4.6mb
The above command will output the subsampled file to
stdout
.
Or, if you have paired Illumina
rasusa -i r1.fq -i r2.fq --coverage 30 --genome-size 4g -o out.r1.fq -o out.r2.fq
For more details on the above options, and additional options, see below.
There are three required options to run rasusa
.
-i
, --input
This option specifies the file(s) containing the reads you would like
to subsample. The file(s) must be valid fasta or fastq format and can be
compressed (with a tool such as gzip
).
Illumina paired files can be passed in two ways. 1. Using
--input
twice -i r1.fq -i r2.fq
2. Using
--input
once, but passing both files immediately after
-i r1.fq r2.fq
Bash wizard tip 🧙: Let globs do the work for you
-i r*.fq
-c
, --coverage
Not required if
--bases
is present
This option is used to determine the minimum coverage to subsample the reads to. It can be specified as an integer (100), a decimal/float (100.7), or either of the previous suffixed with an ‘x’ (100x).
Note: Due to the method for determining how many bases are required to achieve the desired coverage, the actual coverage, in the end, could be slightly higher than requested. For example, if the last included read is very long. The log messages should inform you of the actual coverage in the end.
-g
,
--genome-size
Not required if
--bases
is present
The genome size of the input is also required. It is used to
determine how many bases are necessary to achieve the desired coverage.
This can, of course, be as precise or rough as you like.
Genome size can be passed in many ways. As a plain old integer (1600),
or with a metric suffix (1.6kb). All metric suffixes can have an
optional ‘b’ suffix and be lower, upper, or mixed case. So ‘Kb’, ‘kb’
and ‘k’ would all be inferred as ‘kilo’. Valid metric suffixes
include:
Alternatively, a FASTA/Q index file can be given and the genome size will be set to the sum of all reference sequences in it.
-o
, --output
NOTE: This parameter is required if passing paired Illumina data.
By default, rasusa
will output the subsampled file to
stdout
(if one file is given). If you would prefer to
specify an output file path, then use this option.
Output for Illumina paired files can be specified in the same manner
as --input
1. Using
--output
twice -o out.r1.fq -o out.r2.fq
2.
Using --output
once, but passing both files immediately
after -o out.r1.fq out.r2.fq
The ordering of the output files is assumed to be the same as the
input.
Note: The output will always be in the same format as the input. You
cannot pass fastq as input and ask for fasta as output.
rasusa
will also attempt to automatically infer whether
comression of the output file(s) is required. It does this by detecting
any of the supported extensions: - .gz
: will compress the
output with gzip
-
.bz
or .bz2
: will compress the output with bzip2
-
.lzma
: will compress the output with the xz
LZMA algorithm
-O
,
--output-type
Use this option to manually set the compression algoritm to use for the output file(s). It will override any format automatically detected from the output path.
Valid options are: - g
: gzip
- b
: bzip2
-
l
or x
: xz
LZMA algorithm -
z
: zstd
-
u
: no compression
Note: these options are case insensitive.
-l
,
--compress-level
Compression level to use if compressing the output. By default this is set to the default for the compression type being output.
-b
, --bases
Explicitly set the number of bases required in the subsample. This option takes the number in the same format as genome size.
Note: if this option is given, genome size and coverage are not required, or ignored if they are provided.
-n
, --num
Explicitly set the number of reads in the subsample. This option takes the number in the same format as genome size.
When providing paired reads as input, this option will sample this
many total read pairs. For example, when passing
-n 20 -i r1.fq r2.fq
, the two output files will have 20
reads each, and the read ids will be the same in both.
Note: if this option is given, genome size and coverage are not required.
-f
, --frac
Explicitly set the fraction of total reads in the subsample. The
value given to this option can be a float or a percentage - i.e.,
-f 0.5
and -f 50
will both take half of the
reads.
Note: if this option is given, genome size and coverage are not required.
-s
, --seed
This option allows you to specify the random seed used by the random subsampler. By explicitly setting this parameter, you make the subsample for the input reproducible. The seed is an integer, and by default it is not set, meaning the operating system will seed the random subsampler. You should only pass this parameter if you are likely to want to subsample the same input file again in the future and want the same subset of reads.
-v
Adding this optional flag will make the logging more verbose. By default, logging will produce messages considered “info” or above (see here for more details). If verbosity is switched on, you will additionally get “debug” level logging messages.
$ rasusa --help
Randomly subsample reads to a specified coverage
Usage: rasusa [OPTIONS] --input <INPUT>...
Options:
-i, --input <INPUT>...
The fast{a,q} file(s) to subsample.
For paired Illumina you may either pass this flag twice `-i r1.fq -i r2.fq` or give two files consecutively `-i r1.fq r2.fq`.
-o, --output <OUTPUT>...
Output filepath(s); stdout if not present.
For paired Illumina you may either pass this flag twice `-o o1.fq -o o2.fq` or give two files consecutively `-o o1.fq o2.fq`. NOTE: The order of the pairs is assumed to be the same as that given for --input. This option is required for paired input.
-g, --genome-size <size|faidx>
Genome size to calculate coverage with respect to. e.g., 4.3kb, 7Tb, 9000, 4.1MB
Alternatively, a FASTA/Q index file can be provided and the genome size will be set to the sum of all reference sequences.
If --bases is not provided, this option and --coverage are required
-c, --coverage <FLOAT>
The desired coverage to sub-sample the reads to
If --bases is not provided, this option and --genome-size are required
-b, --bases <bases>
Explicitly set the number of bases required e.g., 4.3kb, 7Tb, 9000, 4.1MB
If this option is given, --coverage and --genome-size are ignored
-n, --num <INT>
Subsample to a specific number of reads
If paired-end reads are passed, this is the number of (matched) reads from EACH file. This option accepts the same format as genome size - e.g., 1k will take 1000 reads
-f, --frac <FLOAT>
Subsample to a fraction of the reads - e.g., 0.5 samples half the reads
Values >1 and <=100 will be automatically converted - e.g., 25 => 0.25
-s, --seed <INT>
Random seed to use
-v
Switch on verbosity
-O, --output-type <u|b|g|l|x|z>
u: uncompressed; b: Bzip2; g: Gzip; l: Lzma; x: Xz (Lzma); z: Zstd
Rasusa will attempt to infer the output compression format automatically from the filename extension. This option is used to override that. If writing to stdout, the default is uncompressed
-l, --compress-level <1-21>
Compression level to use if compressing output. Uses the default level for the format if not specified
-h, --help
Print help (see a summary with '-h')
-V, --version
Print version
“Time flies like an arrow; fruit flies like a banana.”
― Anthony G. Oettinger
The real question is: will rasusa
just needlessly eat
away at your precious time on earth?
To do this benchmark, I am going to use hyperfine.
The data I used comes from
Download and rename the fastq
URL="ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR649/008/SRR6490088/SRR6490088_1.fastq.gz"
wget "$URL" -O - | gzip -d -c > tb.fq
The file size is 2.9G, and it has 379,547 reads.
We benchmark against filtlong
using the same strategy
outlined in Motivation.
TB_GENOME_SIZE=4411532
COVG=50
TARGET_BASES=$(( TB_GENOME_SIZE * COVG ))
FILTLONG_CMD="filtlong --target_bases $TARGET_BASES tb.fq"
RASUSA_CMD="rasusa -i tb.fq -c $COVG -g $TB_GENOME_SIZE -s 1"
hyperfine --warmup 3 --runs 10 --export-markdown results-single.md \
"$FILTLONG_CMD" "$RASUSA_CMD"
Command | Mean [s] | Min [s] | Max [s] | Relative |
---|---|---|---|---|
filtlong --target_bases 220576600 tb.fq |
21.685 ± 0.055 | 21.622 | 21.787 | 21.77 ± 0.29 |
rasusa -i tb.fq -c 50 -g 4411532 -s 1 |
0.996 ± 0.013 | 0.983 | 1.023 | 1.00 |
Summary: rasusa
ran 21.77 ± 0.29 times
faster than filtlong
.
Download and then deinterleave the fastq with pyfastaq
URL="ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR648/008/SRR6488968/SRR6488968.fastq.gz"
wget "$URL" -O - | gzip -d -c - | fastaq deinterleave - r1.fq r2.fq
Each file’s size is 179M and has 283,590 reads.
For this benchmark, we will use seqtk
. We will also
test seqtk
’s 2-pass mode as this is analogous to
rasusa
.
NUM_READS=140000
SEQTK_CMD_1="seqtk sample -s 1 r1.fq $NUM_READS > /tmp/r1.fq; seqtk sample -s 1 r2.fq $NUM_READS > /tmp/r2.fq;"
SEQTK_CMD_2="seqtk sample -2 -s 1 r1.fq $NUM_READS > /tmp/r1.fq; seqtk sample -2 -s 1 r2.fq $NUM_READS > /tmp/r2.fq;"
RASUSA_CMD="rasusa -i r1.fq r2.fq -n $NUM_READS -s 1 -o /tmp/r1.fq -o /tmp/r2.fq"
hyperfine --warmup 10 --runs 100 --export-markdown results-paired.md \
"$SEQTK_CMD_1" "$SEQTK_CMD_2" "$RASUSA_CMD"
Command | Mean [ms] | Min [ms] | Max [ms] | Relative |
---|---|---|---|---|
seqtk sample -s 1 r1.fq 140000 > /tmp/r1.fq; seqtk sample -s 1 r2.fq 140000 > /tmp/r2.fq; |
907.7 ± 23.6 | 875.4 | 997.8 | 1.84 ± 0.62 |
seqtk sample -2 -s 1 r1.fq 140000 > /tmp/r1.fq; seqtk sample -2 -s 1 r2.fq 140000 > /tmp/r2.fq; |
870.8 ± 54.9 | 818.2 | 1219.8 | 1.77 ± 0.61 |
rasusa -i r1.fq r2.fq -n 140000 -s 1 -o /tmp/r1.fq -o /tmp/r2.fq |
492.2 ± 165.4 | 327.4 | 887.4 | 1.00 |
Summary: rasusa
ran 1.84 times faster
than seqtk
(1-pass) and 1.77 times faster than
seqtk
(2-pass)
So, rasusa
is faster than seqtk
but doesn’t
require a fixed number of reads - allowing you to avoid doing maths to
determine how many reads you need to downsample to a specific coverage.
🤓
If you would like to help improve rasusa
you are very
welcome!
For changes to be accepted, they must pass the CI and coverage checks. These include:
rustfmt
. This can be done by
running cargo fmt
in the project directory.cargo clippy --all-features --all-targets -- -D warnings
kcov
.If you use rasusa
in your research, it would be very
much appreciated if you could cite it.
Hall, M. B., (2022). Rasusa: Randomly subsample sequencing reads to a specified coverage. Journal of Open Source Software, 7(69), 3941, https://doi.org/10.21105/joss.03941
@article{Hall2022,
doi = {10.21105/joss.03941},
url = {https://doi.org/10.21105/joss.03941},
year = {2022},
publisher = {The Open Journal},
volume = {7},
number = {69},
pages = {3941},
author = {Michael B. Hall},
title = {Rasusa: Randomly subsample sequencing reads to a specified coverage},
journal = {Journal of Open Source Software}
}