STARsolo: mapping, demultiplexing and quantification for single cell RNA-seq

Major update in STAR 2.7.9a (2021/05/05)

Major updates in STAR 2.7.8a (2021/02/20)

STARsolo

STARsolo is a turnkey solution for analyzing droplet single cell RNA sequencing data (e.g. 10X Genomics Chromium System) built directly into STAR code. STARsolo inputs the raw FASTQ reads files, and performs the following operations * error correction and demultiplexing of cell barcodes using user-input whitelist * mapping the reads to the reference genome using the standard STAR spliced read alignment algorithm * error correction and collapsing (deduplication) of Unique Molecular Identifiers (UMIa) * quantification of per-cell gene expression by counting the number of reads per gene * quantification of other transcriptomic features: splice junctions; pre-mRNA; spliced/unspliced reads similar to Velocyto

STARsolo output is designed to be a drop-in replacement for 10X CellRanger gene quantification output. It follows CellRanger logic for cell barcode whitelisting and UMI deduplication, and produces nearly identical gene counts in the same format. At the same time STARsolo is ~10 times faster than the CellRanger.

Running STARsolo for 10X Chromium scRNA-seq data

Matching CellRanger 3.x.x results

Matching CellRanger 4.x.x and 5.x.x results

Barcode geometry

Simple barcodes

Simple barcode lengths and start positions on barcode reads are described with

--soloCBstart, --soloCBlen, --soloUMIstart, --soloUMIlen

which works with

--soloType CB_UMI_Simple (a.k.a Droplet)

Barcode and cDNA on the same mate

By default, it is assumed that the barcode is located on one of the mates of paired-end read, while cDNA is on the other mate. However, in some scRNA-seq protocols the barcode and cDNA sequences are located on the same mate. In this case, we can specify the mate on which the barcode is located (1 or 2) with --soloBarcodeMate. Also, the barcode/adapter sequences have to be clipped (leaving only cDNA) with --clip5pNbases or --clip3pNbases.

For instance, for the 10X 5’ protocol, the 1st mate contains the barcode at the 5’, with 16b CB, 10b UMI and 13b adapter (39b total). If the 1st mate is sequenced longer than 39b, the remaining bases are cDNA that can be mapped together with the 2nd mate (which contains only cDNA):

--soloBarcodeMate 1   --clip5pNbases 39 0
--soloType CB_UMI_Simple   --soloCBstart 1   --soloCBlen 16   --soloUMIstart 17   --soloUMIlen 10
--readFilesIn read1.fq read2.fq

Complex barcodes

More complex barcodes are activated with --soloType CB_UMI_Complex and are described with the following parameters

soloCBposition              -
strings(s)              position of Cell Barcode(s) on the barcode read.
                        Presently only works with --soloType CB_UMI_Complex, and barcodes are assumed to be on Read2.
                        Format for each barcode: startAnchor_startDistance_endAnchor_endDistance
                        start(end)Anchor defines the anchor base for the CB: 0: read start; 1: read end; 2: adapter start; 3: adapter end
                        start(end)Distance is the distance from the CB start(end) to the Anchor base
                        String for different barcodes are separated by space.
                        Example: inDrop (Zilionis et al, Nat. Protocols, 2017):
                        --soloCBposition  0_0_2_-1  3_1_3_8

soloUMIposition             -
string                  position of the UMI on the barcode read, same as soloCBposition
                        Example: inDrop (Zilionis et al, Nat. Protocols, 2017):
                        --soloUMIposition  3_9_3_14

soloAdapterSequence         -
string:                 adapter sequence to anchor barcodes.

soloAdapterMismatchesNmax   1
int>0:                  maximum number of mismatches allowed in adapter sequence
Cell filtering (calling)

In addition to raw, unfiltered output of gene/cell counts, STARsolo performs cell filtering (a.k.a. cell calling), which aims to select a subset of cells that are likely to be “real” cells as opposed to empty droplets (containing ambient RNA). Two types of filtering are presently implemented: simple (knee-like) and advanced EmptyDrop-like. The selected filtering is also used to produce summary statistics for filtered cells in the Summary.csv file, which is similar to CellRanger’s summary and is useful for Quality Control.

Knee filtering

Knee filtering is similar to the method used by CellRanger 2.2.x. This is turned on by default and is controlled by:

--soloCellFilter  CellRanger2.2

You can also add three numbers for this option (default values are given in parenthesis): the number of expected cells (3000), robust maximum percentile for UMI count (0.99), maximum to minimum ratio for UMI count (10).

EmptyDrop-like filtering

CellRanger 3.0.0 use advanced filtering based on the EmptyDrop algorithm developed by Lun et al. This algorithm calls extra cells compared to the knee filtering, allowing for cells that have relatively fewer UMIs but are transcriptionally different from the ambient RNA. In STARsolo, this filtering can be activated by:

--soloCellFilter  EmptyDrops_CR

It can be followed by 10 numeric parameters: nExpectedCells (3000), maxPercentile (0.99), maxMinRatio (10), indMin (45000), indMax (90000), umiMin (500), umiMinFracMedian (0.01), candMaxN (20000), FDR (0.01), simN (10000).

Cell filtering of previously generated raw matrix

It is possible to run only the filtering algorithm (without the need to re-map) inputting the previously generated raw matrix:

STAR --runMode soloCellFiltering  /path/to/count/dir/raw/   /path/to/output/prefix   --soloCellFilter EmptyDrops_CR

The /path/to/count/dir/raw/ directory should contain the “raw” barcodes.tsv, features.tsv, and matrix.mtx files generated in a previos STARsolo run. The output will contain the filtered files.

Quantification of different transcriptomic features
Multi-gene reads
Multi-gene reads are concordant with (i.e. align equally well to) transcripts of two or more genes. One class of multi-gene read are those that map uniquely to a genomic region where two or more genes overlap. Another class are those reads that map to multiple loci in the genome, with each locus annotated to a different gene.
Including multi-gene reads allows for more accurate gene quantification and, more importantly, enables detection of gene expression from certain classes of genes that are supported only by multi-gene reads, such as overlapping genes and highly similar paralog families.
The multi-gene read recovery options are specified with --soloMultiMappers. Several algorithms are implemented:
--soloMultiMappers Uniform uniformly distributes the multi-gene UMIs to all genes in its gene set. Each gene gets a fractional count of 1/N_genes, where N_genes is the number of genes in the set. This is the simplest possible option, and it offers higher sensitivity for gene detection at the expense of lower precision.
--soloMultiMappers Uniform distributes the multi-gene UMIs proportionally to the number of unqiue UMIs per gene. UMIs that map to genes that are not supported by unique UMIs are distributed uniformly.
--soloMultiMappers EM uses Maximum Likelihood Estimation (MLE) to distribute multi-gene UMIs among their genes, taking into account other UMIs (both unique- and multi-gene) from the same cell (i.e. with the same CB). Expectation-Maximization (EM) algorithm is used to find the gene expression values that maximize the likelihood function. Recovering multi-gene reads via MLE-EM model was previously used to quantify transposable elements in bulk RNA-seq {TEtranscripts} and in scRNA-seq {Alevin; Kallisto-bustools}.
--soloMultiMappers Rescue distributes multi-gene UMIs to their gene set proportionally to the sum of the number of unique-gene UMIs and uniformly distributed multi-gene UMIs in each gene Mortazavi et al. It can be thought of as the first step of the EM algorithm.
Any combination of these options can be specified and different multi-gene falvors will be output into different files. The unique-gene UMI counts are output into the matrix.mtx file in the raw/Gene directory, while the sum of unique+multi-gene UMI counts will be output into UniqueAndMult-EM.mtx, UniqueAndMult-PropUnique.mtx, UniqueAndMult-Rescue.mtx, UniqueAndMult-Uniform.mtx files.

BAM tags

Input reads from BAM files.

The read sequences and barcodes can be input from SAM (or BAM) files, both unmapped (uBAM) and previously mapped (e.g. Cellranger’s BAM):

--readFilesIn input.bam --readFilesType SAM SE

In case of BAM files, use samtools view command to convert to BAM:

--readFilesCommand samtools view -F 0x100

The file should contain one line for each read. For previously mapped file it can be achieved by filtering out non-primary alignments as shown above. Note that unmapped reads have to be included in the file to be remapped. We need to specify which SAM attributes correspond to seqeunces/qualities of cell barcodes (CR/CY) and UMIs (UR/UY):

--soloInputSAMattrBarcodeSeq CR UR    --soloInputSAMattrBarcodeQual CY UY

If you wish to omit some, All or None of the SAM attributes in the output BAM file (if you requested one), use --readFilesSAMattrKeep option. For previously mapped files, --readFilesSAMattrKeep None is often the best option to avoid duplicated SAM attributes in the BAM output.

If you request coordinate-sorted BAM output, and use a coordinate-sorted mapped BAM input (such as CellRanger’s possorted BAM), it may result in slow sorting and require large amountss of RAM. In this case, it is recommended to shuffle the alignments before mapping with samtools bamshuf command.

Different scRNA-seq technologies

Plate-based (Smart-seq) scRNA-seq

Plate-based (Smart-seq) scRNA-seq technologies produce separate FASTQ files for each cell. Cell barcodes are not incorporated in the read sequences, and there are no UMIs. Typical STAR command for mapping and quantification of these file will look like:

--soloType SmartSeq --readFilesManifest /path/to/manifest.tsv --soloUMIdedup Exact --soloStrand Unstranded
Read1-file-name \t Read2-file-name \t Cell-id

For single-end reads, the 2nd column should contain the dash - :

Read1-file-name \t - \t Cell-id

Cell-id can be any string without spaces. Cell-id will be added as ReadGroup tag (RG:Z:) for each read in the SAM/BAM output. If Cell-id starts with ID:, it can contain several fields separated by tab, and all the fields will be copied verbatim into SAM @RG header line. * Deduplication based on read start/end coordinates can be done with --soloUMIdedup Exact option. To avoid deduplication (e.g. for single-end reads) use --soloUMIdedup NoDedup. Both deduplication options can be used together --soloUMIdedup Exact NoDedup and will produce two .mtx matrices. * Common Smart-seq protocols are unstranded and thus will require --soloStrand Unstranded option. If your protocol is stranded, you can can choose the proper --soloStrand Forward (default) or --soloStrand Reverse options.

————————————————————-

All parameters that control STARsolo output are listed again below with defaults and short descriptions:
soloType                    None
    string(s): type of single-cell RNA-seq
                            CB_UMI_Simple   ... (a.k.a. Droplet) one UMI and one Cell Barcode of fixed length in read2, e.g. Drop-seq and 10X Chromium.
                            CB_UMI_Complex  ... one UMI of fixed length, but multiple Cell Barcodes of varying length, as well as adapters sequences are allowed in read2 only, e.g. inDrop.
                            CB_samTagOut    ... output Cell Barcode as CR and/or CB SAm tag. No UMI counting. --readFilesIn cDNA_read1 [cDNA_read2 if paired-end] CellBarcode_read . Requires --outSAMtype BAM Unsorted [and/or SortedByCoordinate]
                            SmartSeq        ... Smart-seq: each cell in a separate FASTQ (paired- or single-end), barcodes are corresponding read-groups, no UMI sequences, alignments deduplicated according to alignment start and end (after extending soft-clipped bases)

soloCBwhitelist             -
    string(s): file(s) with whitelist(s) of cell barcodes. Only --soloType CB_UMI_Complex allows more than one whitelist file.
                            None            ... no whitelist: all cell barcodes are allowed

soloCBstart                 1
    int>0: cell barcode start base

soloCBlen                   16
    int>0: cell barcode length

soloUMIstart                17
    int>0: UMI start base

soloUMIlen                  10
    int>0: UMI length

soloBarcodeReadLength       1
    int: length of the barcode read
                            1   ... equal to sum of soloCBlen+soloUMIlen
                            0   ... not defined, do not check

soloBarcodeMate             0                            
    int: identifies which read mate contains the barcode (CB+UMI) sequence
                            0   ... barcode sequence is on separate read, which should always be the last file in the --readFilesIn listed
                            1   ... barcode sequence is a part of mate 1
                            2   ... barcode sequence is a part of mate 2

soloCBposition              -
    strings(s)              position of Cell Barcode(s) on the barcode read.
                            Presently only works with --soloType CB_UMI_Complex, and barcodes are assumed to be on Read2.
                            Format for each barcode: startAnchor_startPosition_endAnchor_endPosition
                            start(end)Anchor defines the Anchor Base for the CB: 0: read start; 1: read end; 2: adapter start; 3: adapter end
                            start(end)Position is the 0-based position with of the CB start(end) with respect to the Anchor Base
                            String for different barcodes are separated by space.
                            Example: inDrop (Zilionis et al, Nat. Protocols, 2017):
                            --soloCBposition  0_0_2_-1  3_1_3_8

soloUMIposition             -
    string                  position of the UMI on the barcode read, same as soloCBposition
                            Example: inDrop (Zilionis et al, Nat. Protocols, 2017):
                            --soloCBposition  3_9_3_14

soloAdapterSequence         -
    string:                 adapter sequence to anchor barcodes.

soloAdapterMismatchesNmax   1
    int>0:                  maximum number of mismatches allowed in adapter sequence.

soloCBmatchWLtype           1MM_multi
    string:                 matching the Cell Barcodes to the WhiteList
                            Exact                           ... only exact matches allowed
                            1MM                             ... only one match in whitelist with 1 mismatched base allowed. Allowed CBs have to have at least one read with exact match.
                            1MM_multi                       ... multiple matches in whitelist with 1 mismatched base allowed, posterior probability calculation is used choose one of the matches.
                                                                Allowed CBs have to have at least one read with exact match. This option matches best with CellRanger 2.2.0
                            1MM_multi_pseudocounts          ... same as 1MM_Multi, but pseudocounts of 1 are added to all whitelist barcodes.
                            1MM_multi_Nbase_pseudocounts    ... same as 1MM_multi_pseudocounts, multimatching to WL is allowed for CBs with N-bases. This option matches best with CellRanger >= 3.0.0


soloInputSAMattrBarcodeSeq  -
    string(s):              when inputting reads from a SAM file (--readsFileType SAM SE/PE), these SAM attributes mark the barcode sequence (in proper order).
                            For instance, for 10X CellRanger or STARsolo BAMs, use --soloInputSAMattrBarcodeSeq CR UR .
                            This parameter is required when running STARsolo with input from SAM.

soloInputSAMattrBarcodeQual  -
    string(s):              when inputting reads from a SAM file (--readsFileType SAM SE/PE), these SAM attributes mark the barcode qualities (in proper order).
                            For instance, for 10X CellRanger or STARsolo BAMs, use --soloInputSAMattrBarcodeQual CY UY .
                            If this parameter is '-' (default), the quality 'H' will be assigned to all bases.

soloStrand                  Forward
    string: strandedness of the solo libraries:
                            Unstranded  ... no strand information
                            Forward     ... read strand same as the original RNA molecule
                            Reverse     ... read strand opposite to the original RNA molecule

soloFeatures                Gene
    string(s):              genomic features for which the UMI counts per Cell Barcode are collected
                            Gene            ... genes: reads match the gene transcript
                            SJ              ... splice junctions: reported in SJ.out.tab
                            GeneFull        ... full genes: count all reads overlapping genes' exons and introns

soloUMIdedup                1MM_All
    string(s):              type of UMI deduplication (collapsing) algorithm
                            1MM_All                     ... all UMIs with 1 mismatch distance to each other are collapsed (i.e. counted once).
                            1MM_Directional_UMItools    ... follows the "directional" method from the UMI-tools by Smith, Heger and Sudbery (Genome Research 2017).
                            1MM_Directional             ... same as 1MM_Directional_UMItools, but with more stringent criteria for duplicate UMIs
                            Exact                       ... only exactly matching UMIs are collapsed.
                            NoDedup                     ... no deduplication of UMIs, count all reads.
                            1MM_CR                      ... CellRanger2-4 algorithm for 1MM UMI collapsing.

soloUMIfiltering            -
    string(s)               type of UMI filtering
                            -               ... basic filtering: remove UMIs with N and homopolymers (similar to CellRanger 2.2.0).
                            MultiGeneUMI    ... basic + remove lower-count UMIs that map to more than one gene.
                            MultiGeneUMI_CR ... basic + remove lower-count UMIs that map to more than one gene, matching CellRanger > 3.0.0 .
                                                Only works with --soloUMIdedup 1MM_CR

soloOutFileNames            Solo.out/          features.tsv barcodes.tsv        matrix.mtx
    string(s)               file names for STARsolo output:
                            file_name_prefix   gene_names   barcode_sequences   cell_feature_count_matrix

soloCellFilter              CellRanger2.2 3000 0.99 10
    string(s):              cell filtering type and parameters
                            None            ... do not output filtered cells
                            TopCells        ... only report top cells by UMI count, followed by the exact number of cells
                            CellRanger2.2   ... simple filtering of CellRanger 2.2.
                                                Can be followed by numbers: number of expected cells, robust maximum percentile for UMI count, maximum to minimum ratio for UMI count
                                                The harcoded values are from CellRanger: nExpectedCells=3000;  maxPercentile=0.99;  maxMinRatio=10
                            EmptyDrops_CR   ... EmptyDrops filtering in CellRanger flavor. Please cite the original EmptyDrops paper: A.T.L Lun et al, Genome Biology, 20, 63 (2019): https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1662-y
                                                Can be followed by 10 numeric parameters:  nExpectedCells   maxPercentile   maxMinRatio   indMin   indMax   umiMin   umiMinFracMedian   candMaxN   FDR   simN
                                                The harcoded values are from CellRanger:             3000            0.99            10    45000    90000      500               0.01      20000  0.01  10000

soloOutFormatFeaturesGeneField3 "Gene Expression"
    string(s):              field 3 in the Gene features.tsv file. If "-", then no 3rd field is output.