TruSPAdes 1.0 Manual

1. About truSPAdes
    1.1. Illumina TruSeq data
    1.2. TruSPAdes input
    1.3. TruSPAdes performance
    1.4. Installation
2. TruSPAdes input format
    1.2.1. Dataset file format
    1.2.2. Read files naming for automatic dataset file generation
3. Running truSPAdes
    3.1. TruSPAdes command line options
    3.2. TruSPAdes output
4. Citation
5. Feedback and bug reports

1. About truSPAdes

TruSPAdes is an assembler for short reads produced by Illumina TruSeq Long Read technology. TruSPAdes accepts as an input a collection of demultiplexed TruSeq reads and assembles long virtual reads. This manual will help you to use truSPAdes.

1.1 Illumina TruSeq data

TruSeq Synthetic Long Reads technology is based on fragmenting genomic DNA into large segments (about 10Kb long) and forming random pools of the resulting segments (each pool contains about 300 segments). Next, these fragments are clonally amplified, sheared, and marked with a unique barcode. Afterwards, they are sequenced using the standard Illumina short reads technology. All short reads originating from the same barcode are assembled together resulting in a set of long contigs (this step is called TruSeq barcode assembly). Ideally, the result of such sequencing effort for a single barcode is the collection of 300 fragments (each fragment is about 10kb long) from a genome forming 300 long virtual reads. Together, these segments are expected to cover about 3 million nucleotides (barcode span). TruSPAdes is a tool for barcode assembly. TruSPAdes assembles each barcode separately and outputs assembled contigs as Virtual TruSeq Long Reads (TSLRs).

1.2 TruSPAdes input

TruSPAdes requires demultiplexed sequencing data as an input. Each barcode is assembled from one or several (in case multiple lanes are available) libraries of paired-end reads in fastq format. Also note that truSPAdes can handle both compressed and decompressed reads. TruSPAdes uses dataset file format for TruSeq dataset description. Dataset file can be either created manually or generated automatically. Since TruSeq reads are provided as a service by Illumina, most TruSeq read files have particular naming patterns that we use for autogeneration of dataset file. See details in section 2.

1.3 TruSPAdes performance

TruSPAdes assembles standard Illumina TruSeq dataset (consisting of 384 barcodes) in 30 hours using 8 threads of Intel Xeon 2.27GHz processor and requires less than 16 Gb RAM (2Gb RAM per thread).

1.4 Installation

TruSPAdes comes as a part of SPAdes genome assembler. See SPAdes manual for installation instructions.

2. TruSPAdes input format

TruSPAdes uses dataset file format for TruSeq dataset description. Dataset file can be either created manually or generated automatically. Below we describe dataset file format and conditions that enable automatic generation of dataset files.

2.1 Dataset file format

Dataset file describes a collection of read files. Each line in dataset file contains description of reads corresponding to a single barcode. Description should start with id of barcode (a string consisting of letters and digits) followed by paths to reads. For example, for dataset with three barcodes dataset file should look like this:


    barcodeId1 /FULL_PATH_TO_LEFT_READS1/LEFT_READS_FILE_NAME1.fastq.gz /FULL_PATH_TO_RIGHT_READS1/RIGHT_READS_FILE_NAME1.fastq.gz
    barcodeId2 /FULL_PATH_TO_LEFT_READS2/LEFT_READS_FILE_NAME2.fastq.gz /FULL_PATH_TO_RIGHT_READS2/RIGHT_READS_FILE_NAME2.fastq.gz
    barcodeId3 /FULL_PATH_TO_LEFT_READS3/LEFT_READS_FILE_NAME3.fastq.gz /FULL_PATH_TO_RIGHT_READS3/RIGHT_READS_FILE_NAME3.fastq.gz

In case several lanes are available for a barcode, reads for each lane should be provided in the same line. Reads from the same barcode should be written consequently, e.g., in case two lanes are available for barcode, its description should look like this:


    barcodeId /LEFT_READS_PATH_L1.fastq.gz /RIGHT_READS_PATH_L1.fastq.gz /LEFT_READS_PATH_L2.fastq.gz /RIGHT_READS_PATH_L2.fastq.gz

Dataset file can be provided using --dataset option (see section 3).

2.2 Read files naming for automatic dataset file generation

Since TruSeq reads are provided as a service by Illumina, most TruSeq read files have particular naming patterns that we use to automatically generate dataset files. If read naming in your dataset reads satisfies conditions described below you can avoid manual creation of dataset file and use --input-dir option instead (see section 3). The conditions that enable automatic dataset file generation are as follows:

All reads should be in one or several directories provided using --input-dir option
No other files should be present in these directories
Names of files with left (right) reads should contain "R1" ("R2") as their substring.
Files with left and right reads of the same library should be in the same directory and their names should differ only by substitution of "R1" to "R2"
In case several lanes are available, reads from the same lane can be marked with "L<n>" label (i.e. contain it as a substring of read name) where <n> is the lane number.
Reads from the same barcode but from different lanes can be put into different directories but their names should either coinside or differ only by "L" markings.
All barcodes should have the same number of paired-end libraries (lanes). This is to check for possibly missing or extra files in input directories.

For example dataset with two lanes and two barcodes can be put into single directory with the following file naming:


    dataset-directory:
        reads_L1_R1.fastq
        reads_L1_R2.fastq
        reads_L2_R1.fastq
        reads_L2_R2.fastq

Or dataset files can be put into two directories:


    dataset_directory_L1:
        reads_R1.fastq
        reads_R2.fastq
    dataset_directory_L2:
        reads_R1.fastq
        reads_R2.fastq

3. Running truSPAdes

3.1 TruSPAdes command line options

To run truSPAdes from the command line, type


    truspades.py [options] -o <output_dir>

Note that we assume that truSPAdes installation directory is added to the PATH variable (provide full path to truSPAdes executable otherwise: <truspades installation dir>/truspades.py).

Basic options

-o <output_dir>
Specify the output directory. Required option.

-h (or --help)
Prints help.

-v (or --version)
Prints version.

--continue
Continues truSPAdes run from the specified output folder.

-t <int> (or --threads <int>)
Number of threads. The default value is 8.

Input data

--dataset <file_name>
Dataset file foratted as specified in section 2.1.

--input-dir <dir_name>
Directory containing reads. Note that naming of read files should satisfy conditions described in section 2.2.

3.2. TruSPAdes output

TruSPAdes stores all output files in <output_dir> , which is set by the user. Resulting TruSeq long reads will be stored in <output_dir>/TSLRs.fastq The full list of <output_dir> content is presented below:

    TSLRs.fastq – resulting truseq long reads
    TSLRs.fasta – resulting truseq long reads in fasta format
    dataset.info – dataset file
    truspades.log – truSPAdes log file
    barcodes – directory containing output files for separate barcodes
    logs – directory containing log files for barcode assembly

4. Citation

If you use truSPAdes in your research, please include Bankevich & Pevzner, 2016 in your reference list.

5. Feedback and bug reports

Your comments, bug reports, and suggestions are very welcomed. They will help us to further improve truSPAdes.

If you have any troubles running truSPAdes, please send us dataset.info and truspades.log from the directory <output_dir>.

Address for communications: spades.support@cab.spbu.ru.