Alevin

Alevin is a tool --- integrated with the salmon software --- that introduces a family of algorithms for quantification and analysis of 3' tagged-end single-cell sequencing data. Currently alevin supports the following two major droplet based single-cell protocols:

  1. Drop-seq

  2. 10x-Chromium v1/2/3

Alevin works under the same indexing scheme (as salmon) for the reference, and consumes the set of FASTA/Q files(s) containing the Cellular Barcode(CB) + Unique Molecule identifier (UMI) in one read file and the read sequence in the other. Given just the transcriptome and the raw read files, alevin generates a cell-by-gene count matrix (in a fraction of the time compared to other tools).

Alevin works in two phases. In the first phase it quickly parses the read file containing the CB and UMI information to generate the frequency distribution of all the observed CBs, and creates a lightweight data-structure for fast-look up and correction of the CB. In the second round, alevin utilizes the read-sequences contained in the files to map the reads to the transcriptome, identify potential PCR/sequencing errors in the UMIs, and performs hybrid de-duplication while accounting for UMI collisions. Finally, a post-abundance estimation CB whitelisting procedure is done and a cell-by-gene count matrix is generated.

Using Alevin

Alevin requires the following minimal set of necessary input parameters (generally providing the flags in that order is recommended):

  • -l: library type (same as salmon), we recommend using ISR for both Drop-seq and 10x-v2 chemistry.

  • -1: CB+UMI file(s), alevin requires the path to the FASTQ file containing CB+UMI raw sequences to be given under this command line flag. Alevin also supports parsing of data from multiple files as long as the order is the same as in -2 flag.

  • -2: Read-sequence file(s), alevin requires the path to the FASTQ file containing raw read-sequences to be given under this command line flag. Alevin also supports parsing of data from multiple files as long as the order is the same as in -1 flag.

  • --dropseq / --chromium / --chromiumV3: the protocol, this flag tells the type of single-cell protocol of the input sequencing-library.

  • -i: index, file containing the salmon index of the reference transcriptome, as generated by salmon index command.

  • -p: number of threads, the number of threads which can be used by alevin to perform the quantification, by default alevin utilizes all the available threads in the system, although we recommend using ~10 threads which in our testing gave the best memory-time trade-off.

  • -o: output, path to folder where the output gene-count matrix (along with other meta-data) would be dumped.

  • --tgMap: transcript to gene map file, a tsv (tab-separated) file --- with no header, containing two columns mapping of each transcript present in the reference to the corresponding gene (the first column is a transcript and the second is the corresponding gene).

Once all the above requirement are satisfied, alevin can be run using the following command:

> salmon alevin -l ISR -1 cb.fastq.gz -2 reads.fastq.gz --chromium  -i salmon_index_directory -p 10 -o alevin_output --tgMap txp2gene.tsv

Providing multiple read files to Alevin

Often, a single library may be split into multiple FASTA/Q files. Also, sometimes one may wish to quantify multiple replicates or samples together, treating them as if they are one library. Alevin allows the user to provide a space-separated list of files to all of it's options that expect input files (i.e. -1, -2). The order of the files in the left and right lists must be the same. There are a number of ways to provide alevin with multiple CB and read files, and treat these as a single library. For the examples below, assume we have two replicates lib_A and lib_B. The left and right reads for lib_A are lib_A_cb.fq and lib_A_reads.fq, respectively. The left and right reads for lib_B are lib_B_cb.fq and lib_B_read.fq, respectively. The following are both valid ways to input these reads to alevin:

> salmon alevin -lISR -1 lib_A_cb.fq lib_B_cb.fq -2 lib_A_read.fq lib_B_read.fq

Similarly, both of these approaches can be adopted if the files are gzipped as well:

> salmon alevin -l ISR -1 lib_A_cb.fq.gz lib_B_cb.fq.gz -2 lib_A_read.fq.gz lib_B_read.fq.gz

Note

Don't provide data through input stream

To keep the time-memory trade-off within acceptable bounds, alevin performs multiple passes over the Cellular Barcode file. Alevin goes through the barcode file once by itself, and then goes through both the barcode and read files in unison to assign reads to cells using the initial barcode mapping. Since the pipe or the input stream can't be reset to read from the beginning again, alevin can't read in the barcodes, and might crash.

Description of important options

Alevin exposes a number of useful optional command-line parameters to the user. The particularly important ones are explained here, but you can always run salmon alevin -h to see them all.

-p / --numThreads

The number of threads that will be used for quantification. Alevin is designed to work well with many threads, so, if you have a sufficient number of processors, larger values here can speed up the run substantially. In our testing we found that usually 10 threads gives the best time-memory trade-off.

Note

Default number of threads

The default behavior is for Alevin to probe the number of available hardware threads and to use this number.

Thus, if you want to use fewer threads (e.g., if you are running multiple instances of Salmon simultaneously), you will likely want to set this option explicitly in accordance with the desired per-process resource usage.

--whitelist

This is an optional argument, where user can explicitly specify the whitelist CB to use for cell detection and CB sequence correction. If not given, alevin generates its own set of putative CBs.

Note

Not 10x 724k whitelist

This flag does not use the biologically known whitelist provided by 10x, instead it's per experiment level whitelist file e.g. the file generated by cellranger with the name barcodes.tsv.

--noQuant

Generally used in parallel with --dumpfq. If Alevin is passed the --noQuant option, the pipeline will stop before starting the mapping. The general use-case is when we only need to concatenate the CB on the read-id of the second file and break the execution afterwards.

--noDedup

If Alevin is passed the --noDedup option, the pipeline only performs CB correction, maps the read-sequences to the transcriptome generating the interim data-structure of CB-EqClass-UMI-count. Used in parallel with --dumpBarcodeEq or --dumpBfh for the purposes of obtaining raw information or debugging.

--mrna

The list of mitochondrial genes which are to be used as a feature for CB whitelising naive Bayes classification.

Note

It is generally advisable to not use nuclear mitrochondrial genes in this as they can be both up and/or down regulated which might cancel out the usefulness of this feature. Please check issue #367 in salmon repo to know more about it.

--rrna

The list of ribosomal genes which are to be used as a feature for CB whitelising naive Bayes classification.

--dumpfq

Generally used along with --noQuant. If activated, alevin will sequence correct the CB and attach the corrected CB sequence to the read-id in the second file and dumps the result to standard-out (stdout).

--dumpBfh

Alevin internally uses a potentially big data-structure to concisely maintain all the required information for quantification. This flags dumps the full CB-EqClass-UMI-count data-structure for the purposed of allowing raw data analysis and debugging.

--dumpFeatures

If activated, alevin dumps all the features used by the CB classification and their counts at each cell level. It's generally used in pair with other command line flags.

--dumpMtx

This flags is used to internally convert the default binary format of alevin for gene-count matrix into a human readable mtx (matrix market exchange) sparse format.

--forceCells

Alevin performs a heuristic based initial CB white-listing by finding the knee in the distribution of the CB frequency. Although knee finding algorithm works pretty well in most of the case, it sometimes over shoot and results in very less number of CB. With this flag, by looking at the CB frequency distribution, a user can explicitly specify the number of CB to consider for initial white-listing.

--expectCells

Just like forceCells flag, it's yet another way of skipping the knee calculation heuristics, if it's failing. This command line flag uses the cellranger type white-listing procedure. As specified in their algorithm overview page, "All barcodes whose total UMI counts exceed m/10 are called as cells", where m is the frequency of the top 1% cells as specified by the parameter of this command line flag.

--numCellBootstraps

Alevin provides an estimate of the inferential uncertainty in the estimation of per cell level gene count matrix by performing bootstrapping of the reads in per-cell level equivalence classes. This command line flag informs Alevin to perform certain number of bootstrap and generate the mean and variance of the count matrix. This option generates three additional file, namely, quants_mean_mat.gz, quants_var_mat.gz and quants_boot_rows.txt. The format of the files stay the same as quants_mat.gz while the row order is saved in quants_boot_rows.txt and the column order is stays the same as in file quants_mat_cols.txt.

Note

Alevin can also dump the full bootstrap cell-gene count matrix of a experiment. To generate inferential replicates of the experiemnt, --numCellBootstraps has to be paired with --dumpFeatures which generates a file with name quants_boot_mat.gz. The output format is the same as quants_mat.gz and we fit the 3D cube of the cell-inference-gene counts in 2D as follows: if an experiment has C cells, G genes and N inferential replicates; alevin output file quants_boot_mat.gz would contain C*N rows and G columns while, starting from the top, the first N rows would represent first cell and it's N inferential replicate. For more information on importing and using inferential replicates for single-cell data in generating accurate differential expression analysis, check out tximport and our Swish paper.

--debug

Alevin peforms intelligent white-listing downstream of the quantification pipeline and has to make some assumptions like chosing a fraction of reads to learn low confidence CB and in turn might erroneously exit -- if the data results in no mapped or deduplicated reads to a CB in low confidence region. The problem doesn’t happen when provided with external whitelist but if there is an error and the user is aware of this being just a warning, the error can be skipped by running Alevin with this flag.

--minScoreFraction

This value controls the minimum allowed score for a mapping to be considered valid. It matters only when --validateMappings has been passed to Salmon. The maximum possible score for a fragment is ms = read_len * ma (or ms = (left_read_len + right_read_len) * ma for paired-end reads). The argument to --minScoreFraction determines what fraction of the maximum score s a mapping must achieve to be potentially retained. For a minimum score fraction of f, only mappings with a score > f * s will be kept. Mappings with lower scores will be considered as low-quality, and will be discarded.

It is worth noting that mapping validation uses extension alignment. This means that the read need not map end-to-end. Instead, the score of the mapping will be the position along the alignment with the highest score. This is the score which must reach the fraction threshold for the read to be considered as valid.

Output

Typical 10x experiment can range form hundreds to tens of thousand of cells -- resulting in huge size of the count-matrices. Traditionally single-cell tools dumps the Cell-v-Gene count matrix in various formats. Although, this itself is an open area of research but by default alevin dumps a per-cell level gene-count matrix in a binary-compressed format with the row and column indexes in a separate file.

A typical run of alevin will generate 4 files:

  • quants_mat.gz -- Compressed count matrix.

  • quants_mat_cols.txt -- Column Header (Gene-ids) of the matrix.

  • quants_mat_rows.txt -- Row Index (CB-ids) of the matrix.

  • quants_tier_mat.gz -- Tier categorization of the matrix.

Along with the Cell-v-Gene count matrix, alevin dumps a 3-fold categorization of each estimated count value of a gene(each cell disjointly) in the form of tiers. Tier 1 is the set of genes where all the reads are uniquely mapping. Tier 2 is genes that have ambiguously mapping reads, but connected to unique read evidence as well, that can be used by the EM to resolve the multimapping reads. Tier 3 is the genes that have no unique evidence and the read counts are, therefore, distributed between these genes according to an uninformative prior.

Alevin can also dump the count-matrix in a human readable -- matrix-market-exchange (_mtx_) format, if given flag --dumpMtx which generates a new output file called quants_mat.mtx.

Output Quality Check

Alevin generated gene-count matrix can be visualized for various quality checks using alevinQC , a shiny based R package and it is actively supported by Charlotte Soneson.

Tutorial & Parsers

We have compiled a step-by-step resource to help get started with aleivn. We have tutorials on how to get input, run and generate output using alevin's framework which can be found here at Alevin Tutorials. The tutorial also covers the topic of integrating alevin with downstream analysis tools like Seurat and Monocle. If you are interested in parsing various output binary formats like quants_mat.gz, quants_tier_mat.gz, cell_umigraph.gz etc. of alevin in python, checkout our companion repo for python parsing. This repo is also available on pip and can be installed through pip install vpolo. We cover how to use this library on our alevin-tutorial website too.

Misc

Finally, the purpose of making this software available is because we believe it may be useful for people dealing with single-cell RNA-seq data. We want the software to be as useful, robust, and accurate as possible. So, if you have any feedback --- something useful to report, or just some interesting ideas or suggestions --- please contact us (asrivastava@cs.stonybrook.edu and/or rob.patro@cs.stonybrook.edu). If you encounter any bugs, please file a detailed bug report at the Salmon GitHub repository.

BibTex

@article{srivastava2019alevin,
title={Alevin efficiently estimates accurate gene abundances from dscRNA-seq data},
author={Srivastava, Avi and Malik, Laraib and Smith, Tom and Sudbery, Ian and Patro, Rob},
journal={Genome biology},
volume={20},
number={1},
pages={65},
year={2019},
publisher={BioMed Central}
}

References

1

Zhu, Anqi, et al. "Nonparametric expression analysis using inferential replicate counts." BioRxiv (2019): 561084.

2

Qiu, Xiaojie, et al. "Reversed graph embedding resolves complex single-cell trajectories." Nature methods 14.10 (2017): 979.

3

Butler, Andrew, et al. "Integrating single-cell transcriptomic data across different conditions, technologies, and species." Nature biotechnology 36.5 (2018): 411.

4

Macosko, Evan Z., et al. "Highly parallel genome-wide expression profiling of individual cells using nanoliter droplets." Cell 161.5 (2015): 1202-1214.

5

Zheng, Grace XY, et al. "Massively parallel digital transcriptional profiling of single cells." Nature communications 8 (2017): 14049.

6

Patro, Rob, et al. "Salmon provides fast and bias-aware quantification of transcript expression." Nature Methods (2017). Advanced Online Publication. doi: 10.1038/nmeth.4197.

7

Petukhov, Viktor, et al. "Accurate estimation of molecular counts in droplet-based single-cell RNA-seq experiments." bioRxiv (2017): 171496.

8

https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/algorithms/overview