RNA-Bloom is a fast and memory-efficient de novo transcript sequence assembler. It is designed for the following sequencing data types: * single-end/paired-end bulk RNA-seq (strand-specific/agnostic) * paired-end single-cell RNA-seq (strand-specific/agnostic) * long-read RNA-seq (ONT cDNA/direct RNA, PacBio cDNA)
Written by Ka Ming Nip :email:
:copyright: 2018-present Canada’s Michael Smith Genome Sciences Centre, BC Cancer
Java SE Development Kit (JDK) 11 (JDK 17 is slightly faster)
External software used:
software | short reads | long reads |
---|---|---|
minimap2 >=2.22 | required | required |
Racon | not used | required |
ntCard >=1.2.1 | required | required |
:warning: Input reads must be in either FASTQ or FASTA format and may be compressed with GZIP.
left
reads are sense and right
reads
are antisense, use -revcomp-right
to reverse-complement
right
readsleft
reads are antisense and right
reads are sense, use -revcomp-left
to reverse-complement
left
reads-revcomp-right
or
-revcomp-left
java -jar RNA-Bloom.jar -left LEFT.fastq -right RIGHT.fastq -revcomp-right -t THREADS -outdir OUTDIR
-sef
for forward reads and -ser
for
reverse readsjava -jar RNA-Bloom.jar -sef SE.fastq -t THREADS -outdir OUTDIR
java -jar RNA-Bloom.jar -left LEFT.fastq -right RIGHT.fastq -revcomp-right -sef SE.fastq -t THREADS -outdir OUTDIR
file name | description |
---|---|
rnabloom.transcripts.fa |
assembled transcripts longer than length threshold (default: 200) |
rnabloom.transcripts.short.fa |
assembled transcripts shorter than length threshold |
rnabloom.transcripts.nr.fa |
assembled transcripts with redundancy reduced |
java -jar RNA-Bloom.jar -pool READSLIST.txt -revcomp-right -t THREADS -outdir OUTDIR
This is especially useful for single-cell datasets. RNA-Bloom was
tested on Smart-seq2 and SMARTer datasets. It is not supported for
long-read data (-long
) at this time.
-pool
option:This is a tabular file that describes the read file paths for all
cells/samples to be used pooled assembly. - Column header is on the
first line, leading with #
- Columns are separated by
space/tab characters - Each sample can have more than one lines; lines
sharing the same name
will be grouped together during
assembly
column | description |
---|---|
name |
sample name |
left |
path to left read file |
right |
path to right read file |
sef |
path to single-end forward read file |
ser |
path to single-end reverse read file |
Only name
, left
, and right
columns are specified for a total of 3 columns. The legacy header-less
tri-column format is still supported.
#name left right
cell1 /path/to/cell1/left.fastq /path/to/cell1/right.fastq
cell2 /path/to/cell2/left.fastq /path/to/cell2/right.fastq
cell3 /path/to/cell3/left.fastq /path/to/cell3/right.fastq
In addition to name
, left
, and
right
columns, either sef
, ser
or
both are specified for a total of 4~5 columns.
#name left right sef ser
cell1 /path/to/cell1/left.fastq /path/to/cell1/right.fastq /path/to/cell1/sef.fastq /path/to/cell1/ser.fastq
cell2 /path/to/cell2/left.fastq /path/to/cell2/right.fastq /path/to/cell2/sef.fastq /path/to/cell2/ser.fastq
cell3 /path/to/cell3/left.fastq /path/to/cell3/right.fastq /path/to/cell3/sef.fastq /path/to/cell3/ser.fastq
file name | description |
---|---|
rnabloom.transcripts.fa |
assembled transcripts longer than length threshold (default: 200) |
rnabloom.transcripts.short.fa |
assembled transcripts shorter than length threshold |
rnabloom.transcripts.nr.fa |
assembled transcripts with redundancy reduced |
java -jar RNA-Bloom.jar -stranded ...
The -stranded
option indicates that input reads are
strand-specific.
Strand-specific reads are typically in the F2R1 orientation, where
/2
denotes left reads in forward
orientation and /1
denotes right reads in
reverse orientation.
Configure the read file paths accordingly for bulk RNA-seq data and indicate read orientation:
-stranded -left /path/to/reads_2.fastq -right /path/to/reads_1.fastq -revcomp-right
and for scRNA-seq data:
cell1 /path/to/cell1/reads_2.fastq /path/to/cell1/reads_1.fastq
java -jar RNA-Bloom.jar -ref TRANSCRIPTS.fasta ...
The -ref
option specifies the reference transcriptome
FASTA file for guiding short-read assembly. It is not supported for
long-read data (-long
) at this time.
:warning: It is strongly recommended to trim adapters in your reads before assembly. For example, see Porechop for more information.
Default presets for -long
are intended for ONT data.
Please add the -lrpb
flag for PacBio data.
java -jar RNA-Bloom.jar -long LONG.fastq -t THREADS -outdir OUTDIR
Input reads are expected to be in a mix of both forward and reverse orientations.
Options -pool
and -ref
are not supported
for long-read data at this time.
java -jar RNA-Bloom.jar -long LONG.fastq -stranded -t THREADS -outdir OUTDIR
Input reads are expected to be only in the forward orientation.
By default, uracil (U
) is written as T
. Use
the -uracil
option to write U
instead of
T
in the output assembly.
ntCard v1.2.1 supports uracil in reads.
cDNA data:
java -jar RNA-Bloom.jar -long LONG.fastq -sef SHORT.fastq -t THREADS -outdir OUTDIR
direct RNA data:
java -jar RNA-Bloom.jar -stranded -long LONG.fastq -sef SHORT_FORWARD.fastq -ser SHORT_REVERSE.fastq -t THREADS -outdir OUTDIR
file name | description |
---|---|
rnabloom.transcripts.fa |
assembled transcripts longer than min. length threshold (default: 200) |
rnabloom.transcripts.short.fa |
assembled transcripts shorter than min. length threshold |
If ntcard
is found in your PATH
, then the
-ntcard
option is automatically turned on to count the
number of unique k-mers in your reads.
java -jar RNA-Bloom.jar -fpr 0.01 ...
This sets the size of Bloom filters automatically to accommodate a false positive rate (FPR) of ~1%.
Alternatively, you can specify the exact number of unique k-mers:
java -jar RNA-Bloom.jar -fpr 0.01 -nk 28077715 ...
This sets the size of Bloom filters automatically to accommodate 28,077,715 unique k-mers for a FPR of ~1%.
As a rule of thumb, a lower FPR may result in a better assembly but requires more memory for a larger Bloom filter.
java -jar RNA-Bloom.jar -mem 10 ...
This sets the total size to 10 GB. If neither -nk
,
-ntcard
, or -mem
are used, then the total size
is configured based on the size of input read files.
java -jar RNA-Bloom.jar -stage N ...
N | short reads | long reads |
---|---|---|
1 | construct graph | construct graph |
2 | assemble fragments | correct reads |
3 | assemble transcripts | assemble transcripts |
This is a very useful option if you only want to assemble fragments
or correct long reads (ie. with -stage 2
)!
java -jar RNA-Bloom.jar -help
java -Xmx2g -jar RNA-Bloom.jar ...
or if you installed with conda
:
export JAVA_TOOL_OPTIONS="-Xmx2g"
rnabloom ...
This limits the maximum Java heap to 2 GB with the -Xmx
option. Note that java
options has no effect on Bloom
filter sizes.
See documentation for other JVM options.
RNA-Bloom is written in Java with Apache NetBeans IDE. It uses the following libraries: * Apache Commons CLI * JGraphT * Smile
If you use RNA-Bloom in your work, please cite our manuscript(s).
Ka Ming Nip, Saber Hafezqorani, Kristina K. Gagalova, Readman Chiu, Chen Yang, René L. Warren, and Inanc Birol. Reference-free assembly of long-read transcriptome sequencing data with RNA-Bloom2. bioRxiv. 2022.08.07.503110. doi: 10.1101/2022.08.07.503110
Ka Ming Nip, Readman Chiu, Chen Yang, Justin Chu, Hamid Mohamadi, René L. Warren, and Inanc Birol. RNA-Bloom enables reference-free and reference-guided sequence assembly for single-cell transcriptomes. Genome Research. 2020 Aug;30(8):1191-1200. doi: 10.1101/gr.260174.119. Epub 2020 Aug 17.