GPLv3 license DOI

ROCK

ROCK (Reducing Over-Covering K-mers) is a command line program written in C++ that runs an alternative implementation of the digital normalization method (e.g. Brown et al. 2012, Wedemeyer et al. 2017, Durai and Schulz 2019).

Given one or several FASTQ file(s), the main aim of ROCK is to build a subset of accurate high-throughput sequencing (HTS) reads such that the induced coverage depth is comprised between two specified lower- and upper-bounds. ROCK can therefore be used to reduce and/or homogenize the overall coverage depth within large sets of HTS reads, which is often required to quickly infer accurate de novo genome assemblies (e.g. Desai et al. 2013, Chen et al. 2015). ROCK can also be used to discard low-covering HTS reads, as those ones are often artefactual, highly erroneous or contaminating sequences.

For more details, see the associated publication: Legrand et al. (2022).

Usage

Run ROCK without option to read the following documentation:

Reducing Over-Covering K-mers within FASTQ file(s)

USAGE: rock [options] [files]

OPTIONS:
 -i <file>  file containing the name(s)  of the input FASTQ file(s) to
            process;  single-end: one file name per line;  paired-end:
            two file names  per line  separated by  a comma;  up to 15
            FASTQ file  names can  be specified;  of note,  input file
            name(s) can also be specified as program argument(s)
 -o <file>  file containing the  name(s) of the  output FASTQ file(s);
            FASTQ file name(s) should be structured in the same way as
            the file specified in option -i.
 -k <int>   k-mer length (default 25)
 -c <int>   lower-bound k-mer coverage depth threshold (default: 0)
 -C <int>   upper-bound k-mer coverage depth threshold (default: 70)
 -l <int>   number of hashing function(s) (default: 4)
 -n <int>   expected total number of  distinct k-mers within the input
            read sequences; not compatible with option -l.
 -f <float> maximum expected false positive probability when computing
            the optimal number of hashing functions from the number of
            distinct k-mers specified with option -n (default: 0.05).
 -q <int>   sets as valid  only k-mers  made  up  of  nucleotides with
            Phred score (+33 offset) above this cutoff (default: 0)
 -m <int>   minimum  number  of valid  k-mer(s)  to  consider  a  read 
            (default: 1)
 -v         verbose mode
 -h         prints this message and exit

Notes

Examples

Single FASTQ file

The following Bash command line enables to download and uncompress the FASTQ file (845 Mb) associated to the run accession ERR6807819:

wget -O - https://ftp.sra.ebi.ac.uk/vol1/fastq/ERR680/009/ERR6807819/ERR6807819.fastq.gz | zcat > ERR6807819.fastq

The downloaded FASTQ file ERR6807819.fastq contains 2,310,531 Ion Torrent S5 reads (394,782,864 bases) corresponding to the whole genome sequencing of a SARS-CoV-2 strain.

ROCK can be directly run on this FASTQ file to perform digital normalization using default options (e.g. k = 25, κ = 70, κ’  = 0):

rock  ERR6807819.fastq

After performing the digital normalization of these single-end reads (running time: ~1 minute), ROCK writes the default output FASTQ file ERR6807819.rock.fastq (93,810 HTS reads, 13,467,261 bases).

The default values of the upper- and lower-bound k-mer coverage depth cutoff κ and κ’ can be modified using options -C and -c, respectively:

rock  -c 5  -C 80  ERR6807819.fastq

The above command line leads to 76,523 HTS reads (11,608,269 bases).

As ERR6807819 is the result of the sequencing of a SARS-CoV-2 genome, it is expected that its number F0 of distinct canonical k-mers is far below 3 billions. In consequence, ROCK can be adequately run using only λ = 1 hashing function:

rock  -c 5  -C 80  -l 1  ERR6807819.fastq

The above command line leads to 76,526 HTS reads (11,607,360 bases).

Alternatively, F0 can be priorly estimated (e.g. with k = 25, ntCard returns F0 = 4,840,553), and next specified using option -n:

rock  -c 5  -C 80  -n 4840553  -v  ERR6807819.fastq

As the option -v has been set, the above command line prints the following information:

SE input file(s)                        ERR6807819.fastq
upper-bound coverage depth              80
lower-bound coverage depth              5
k-mer length                            25
expected no. distinct k-mers            4840553
no. hash. function(s)                   1
expected false positive proba.          0.000000 (cov. depth > 1)
no. buckets for hash. 1                 4283781797
no. bits per bucket                     8
count-min sketch size (Gb)              4
no. input reads                         2310531
no. input k-mers                        339330120
no. unset buckets for hash. 1           4280838158
approx. no. distinct k-mers             2944656
approx. false positive proba.           0.000000
no. selected reads                      76526
SE output files                         ERR6807819.rock.fastq

This shows that λ = 1 hashing function is indeed sufficient to deal with F0 = 4,840,553 distinct canonical k-mers. The selected HTS read subset is therefore identical to the one obtained using -l 1.

Valid k-mers (to fill the CMS) can also be assessed by specifying a minimum Phred score using option -q, e.g. Q ≥ 10:

rock  -c 5  -C 80  -n 4840553  -q 10  ERR6807819.fastq

The above command line leads to 44,687 HTS reads (7,367,710 bases).

Paired FASTQ files

The following Bash command lines enable to download and uncompress the two FASTQ files (199 Mb each) associated to the run accession ERR7756257:

EBIURL="https://ftp.sra.ebi.ac.uk/vol1/fastq";
wget -O - $EBIURL/ERR775/007/ERR7756257/ERR7756257_1.fastq.gz | gunzip -c > ERR7756257.1.fastq ;
wget -O - $EBIURL/ERR775/007/ERR7756257/ERR7756257_2.fastq.gz | gunzip -c > ERR7756257.2.fastq ;

The downloaded FASTQ files ERR7756257.1.fastq and ERR7756257.2.fastq contains 571,893 Illumina MiSeq PE reads (84,146,312 and 84,156,299 bases, respectively) corresponding to the whole genome sequencing of a SARS-CoV-2 strain.

ROCK can be directly run on these paired FASTQ files to perform digital normalization using default options (e.g. k = 25, κ = 70, κ’  = 0):

rock  ERR7756257.1.fastq,ERR7756257.2.fastq

After performing the digital normalization of these HTS PE reads (running time: ~30 seconds), ROCK writes the default output FASTQ files ERR7756257.1.rock.fastq and ERR7756257.2.rock.fastq (17,984 HTS PE reads, 2,633,631 and 2,636,701 bases, respectively).

Input and output FASTQ file names can be specified into txt files, e.g.

echo "ERR7756257.1.fastq,ERR7756257.2.fastq" > infiles.txt ;
echo "ERR7756257.norm.R1.fastq,ERR7756257.norm.R2.fastq" > outfiles.txt ;
rock  -i infiles.txt  -o outfiles.txt

The above command lines lead to the paired FASTQ output files ERR7756257.norm.R1.fastq and ERR7756257.norm.R2.fastq. PE FASTQ files should be specified in the same line, separated by a comma (,), without blank space. Up to 15 FASTQ files (paired or not) can be specified. Of important note, output FASTQ files should be specified in the same way and in the same order as the corresponding input FASTQ files.

Options described above (see Single FASTQ file) can also be used with paired FASTQ files:

rock  -c 5  -C 80  -n 2277975  -q 10  -v  -i infiles.txt  -o outfiles.txt

The above command line prints the following information:

PE input files                          ERR7756257.1.fastq ERR7756257.2.fastq
upper-bound coverage depth              80
lower-bound coverage depth              5
k-mer length                            25
expected no. distinct k-mers            2277975
no. hash. function(s)                   1
expected false positive proba.          0.000000 (cov. depth > 1)
no. buckets for hash. 1                 4283781797
no. bits per bucket                     8
count-min sketch size (Gb)              4
no. input reads                         571792
no. input k-mers                        140851747
no. unset buckets for hash. 1           4282925434
approx. no. distinct k-mers             856450
approx. false positive proba.           0.000000
no. selected reads                      15804
PE output files                         ERR7756257.norm.R1.fastq ERR7756257.norm.R2.fastq

The output FASTQ files contain 15,804 HTS PE reads (2,319,355 and 2,319,367 bases, respectively)

References

Brown CT, Howe A, Zhang Q, Pyrkosz AB, Brom YH (2012) A Reference-Free Algorithm for Computational Normalization of Shotgun Sequencing Data. arXiv:1203.4802v2.

Chen TW, Gan RC, Chang YF, Liao W-C, Wu TH, Lee C-C, Huang P-J, Lee C-Y, Chen Y-YM, Chiu CH, Tang P (2015) Is the whole greater than the sum of its parts? De novo assembly strategies for bacterial genomes based on paired-end sequencing. BMC Genomics, 16:648. doi:10.1186/s12864-015-1859-8.

Cormode G, Muthukrishnan S (2005) An Improved Data Stream Summary: The Count-Min Sketch and its Applications. Journal of Algorithms, 55:29-38. doi:10.1016/j.jalgor.2003.12.001.

Deorowicz S, Debudaj-Grabysz A, Grabowski S (2013) Disk-based k-mer counting on a PC. BMC Bioinformatics, 14:160. doi:10.1186/1471-2105-14-160.

Deorowicz S, Kokot M, Grabowski S, Debudaj-Grabysz A (2015) KMC 2: Fast and resource-frugal k-mer counting. Bioinformatics, 31(10):1569-1576. doi:10.1093/bioinformatics/btv022.

Desai A, Marwah VS, Yadav A, Jha V, Dhaygude K, Bangar U, Kulkarni V, Jere A (2013) Identification of Optimum Sequencing Depth Especially for De Novo Genome Assembly of Small Genomes Using Next Generation Sequencing Data. PLoS ONE, 8(4):e60204. doi:10.1371/journal.pone.0060204.

Durai DA, Schulz MH (2019) Improving in-silico normalization using read weights. Scientific Reports, 9:5133. doi:10.1038/s41598-019-41502-9.

Kokot M, Długosz M, Deorowicz S (2017) KMC 3: counting and manipulating k-mer statistics. Bioinformatics, 33(17):2759-2761. doi:10.1093/bioinformatics/btx304.

Legrand V, Kergrohen T, Joly N, Criscuolo A (2022) ROCK: digital normalization of whole genome sequencing data. Journal of Open Source Software, 7(73):3790. doi:10.21105/joss.03790.

Liu B, Shi Y, Yuan J, Hu X, Zhang H, Li N, Li Z, Chen Y, Mu D, Fan W (2013) Estimation of genomic characteristics by analyzing k-mer frequency in de novo genome projects. arXiv:1308.2012v2.

Melsted P, Halldórsson BV (2014) KmerStream: streaming algorithms for k-mer abundance estimation. Bioinformatics, 30(24):3541-3547. doi:10.1093/bioinformatics/btu713.

Mohamadi H, Khan H, Birol I (2017) ntCard: a streaming algorithm for cardinality estimation in genomics data. Bioinformatics, 33(9):1324-1330. doi:10.1093/bioinformatics/btw832.

Wedemeyer A, Kliemann L, Srivastav A, Schielke C, Reusch TB, Rosenstiel P (2017) An improved filtering algorithm for big read datasets and its application to single-cell assembly. BMC Bioinformatics, 18:324. doi:10.1186/s12859-017-1724-7.

Contributions

Suggestions for improving ROCK or for adding new functionalities, as well as merge requests, are welcome.