*ROCK* (Reducing Over-Covering K-mers) is a command line program written in C++ that runs an alternative implementation of the *digital normalization* method (e.g. Brown et al. 2012, Wedemeyer et al. 2017, Durai and Schulz 2019).

Given one or several FASTQ file(s), the main aim of *ROCK* is to build a subset of accurate high-throughput sequencing (HTS) reads such that the induced coverage depth is comprised between two specified lower- and upper-bounds. *ROCK* can therefore be used to reduce and/or homogenize the overall coverage depth within large sets of HTS reads, which is often required to quickly infer accurate *de novo* genome assemblies (e.g. Desai et al. 2013, Chen et al. 2015). *ROCK* can also be used to discard low-covering HTS reads, as those ones are often artefactual, highly erroneous or contaminating sequences.

Run *ROCK* without option to read the following documentation:

```
Reducing Over-Covering K-mers within FASTQ file(s)
USAGE: rock [options] [files]
OPTIONS:
-i <file> file containing the name(s) of the input FASTQ file(s) to
process; single-end: one file name per line; paired-end:
two file names per line separated by a comma; up to 15
FASTQ file names can be specified; of note, input file
name(s) can also be specified as program argument(s)
-o <file> file containing the name(s) of the output FASTQ file(s);
FASTQ file name(s) should be structured in the same way as
the file specified in option -i.
-k <int> k-mer length (default 25)
-c <int> lower-bound k-mer coverage depth threshold (default: 0)
-C <int> upper-bound k-mer coverage depth threshold (default: 70)
-l <int> number of hashing function(s) (default: 4)
-n <int> expected total number of distinct k-mers within the input
read sequences; not compatible with option -l.
-f <float> maximum expected false positive probability when computing
the optimal number of hashing functions from the number of
distinct k-mers specified with option -n (default: 0.05).
-q <int> sets as valid only k-mers made up of nucleotides with
Phred score (+33 offset) above this cutoff (default: 0)
-m <int> minimum number of valid k-mer(s) to consider a read; all
non-considered reads are written into output file(s) with
extension undefined.fastq (default: 1)
-v verbose mode
-h prints this message and exit
```

In brief, given an upper-bound

*k*-mer coverage depth cutoff*κ*(option`-C`

) and a lower-bound*k*-mer coverage cutoff*κ‘*(option`-c`

), the aim of*ROCK*is to select an HTS read subset*S*_{κ’,κ}such that the overall*k*-mer coverage depth induced by the members of*S*_{κ‘,κ}is expected to be always comprised between*κ’*and*κ*. When considering FASTQ files with high redundancy (i.e. coverage depth greater than*κ*),*ROCK*therefore returns smaller FASTQ files such that each of its HTS read corresponds to a genome region with*k*-mer coverage depth of at most*κ*. Setting*κ’*(option`-c`

) enables to discard HTS reads associated to a*k*-mer coverage depth lower than this lower-bound cutoff, which is often observed with artefactual, highly erroneous or contaminating HTS reads.After creating an empty count-min sketch (CMS; see below) to store the number of occurrences of every canonical

*k*-mer shared by the input HTS reads,*ROCK*proceeds in three main steps :- Sorting the input SE/PE HTS reads from the most to the less accurate ones (as defined by the sum of the Phred scores).
- For each sorted SE/PE HTS read(s), approximating its
*k*-mer coverage depth*c*_{k}(defined by the median of its*k*-mer occurence values, as returned by the CMS); if*c*_{k}≤*κ*, then adding the SE/PE HTS read(s) into the subset*S*_{κ’,κ}and updating the CMS for every corresponding canonical*k*-mer. - (when the lower-bound cutoff
*κ‘*> 0) For each SE/PE HTS read(s) in*S*_{κ’,κ}, (re)approximating its*k*-mer coverage depth*c*_{k}; if*c*_{k}≤*κ‘*, then removing the SE/PE HTS read(s) from*S*_{κ’,κ}.

At the end, all SE/PE HTS read(s) inside

*S*_{κ‘,κ}are written into output FASTQ file(s) (by default, file extension*.rock.fastq*). It is worth noticing that the step 1 ensures that all the HTS reads inside the returned subset*S*_{κ’,κ}are the most accurate ones.*ROCK*stores the number of occurences of every traversed canonical*k*-mer in a count-min sketch (CMS; e.g. Cormode and Muthukrishnan 2005), a dedicated probabilistic data structure with controllable false positive probability (FPP). By default,*ROCK*instantiates a CMS based on four hashing functions (option`-l`

), which can be sufficient for many cases, e.g. up to 10 billions canonical*k*-mers with*κ*>*κ‘*> 1 and FPP ≤ 5%. However, as each hashing function is defined on [0, 2^{32}[, the memory footprint of the CMS is 4*λ*Gb (when*κ*≤ 255, twice otherwise), where*λ*is the total number of hashing functions. It is therefore highly recommanded to provide the expected number*F*_{0}of canonical*k*-mers (option`-n`

) to enable*ROCK*to compute the optimal CMS dimension*λ*required to store this specified number of canonical*k*-mers with low FPP (option`-f`

), e.g.*λ*= 1 is sufficient to deal with up to 3 billions canonical*k*-mers when*κ*>*κ’*> 1 while ensuring FPP ≤ 5%. Moreover, a CMS based on few hashing functions entails faster running times. For instance, the programs*KMC*(Deorowicz et al. 2013, 2015; Kokot et al. 2017),*KmerStream*(Melsted and Halldórsson 2014) or*ntCard*(Mohamadi et al. 2017) can be used to quickly approximate this number (*F*_{0}).Of important note is that each of the upper- and lower-bound cutoffs (options

`-C`

and`-c`

, respectively) corresponds to a*k*-mer coverage depth value (denoted here*c*_{k}), which is quite different to the base coverage depth value (denoted here*c*_{b}). However, when*L*is the average input HTS read length,*c*_{b}/*c*_{k}and*L*/ (*L*-*k*+1) are expected to be identical for any fixed small*k*(e.g. Liu et al. 2013). In consequence, when an overall (base) coverage depth*c*_{b}is expected, one can therefore set*κ*=*c*_{b}(*L*-*k*+1) /*L*. For example, when dealing with HTS reads of length*L*= 144 (on average), an HTS read subset with expected base coverage depth*c*_{b}= 60x can be inferred by*ROCK*by setting*k*= 25 (option`-k`

) and*κ*= 60 (144-25+1) / 144 = 50 (option`-C`

).By default,

*ROCK*uses*k*-mers of length*k*= 25 (option`-k`

). Increasing this length is not recommanded when dealing with large FASTQ files (e.g. average coverage depth > 500x from genome size > 1 Gbps), as the total number of canonical*k*-mers can quickly grow, therefore implying a very large CMS (i.e. many hashing functions) to maintains low FPP (e.g. ≤ 0.05). Using small*k*-mers (e.g.*k*< 21) is also not recommanded, as this can negatively affect the overall specificity (i.e. too many identical*k*-mers arising from different sequenced genome region).All

*ROCK*steps are based on the usage of valid*k*-mers, i.e.*k*-mers that do not contain any undetermined base`N`

. Valid*k*-mers can also be determined by bases associated to a Phred score greater than a specified threshold (option`-q`

; Phred +33 offset, default: 0). A minimum number of valid*k*-mers can be specified to consider a SE/PE HTS read(s) (option`-m`

; default: 1). All SE/PE HTS read(s) that do not contain enough valid*k*-mers are written into FASTQ file(s) with extension*.undetermined.fastq*.

Brown CT, Howe A, Zhang Q, Pyrkosz AB, Brom YH (2012) *A Reference-Free Algorithm for Computational Normalization of Shotgun Sequencing Data*. **arXiv**:1203.4802v2.

Chen TW, Gan RC, Chang YF, Liao W-C, Wu TH, Lee C-C, Huang P-J, Lee C-Y, Chen Y-YM, Chiu CH, Tang P (2015) *Is the whole greater than the sum of its parts? De novo assembly strategies for bacterial genomes based on paired-end sequencing*. **BMC Genomics**, 16:648. doi:10.1186/s12864-015-1859-8.

Cormode G, Muthukrishnan S (2005) *An Improved Data Stream Summary: The Count-Min Sketch and its Applications*. **Journal of Algorithms**, 55:29-38. doi:10.1016/j.jalgor.2003.12.001.

Deorowicz S, Debudaj-Grabysz A, Grabowski S (2013) *Disk-based k-mer counting on a PC*. **BMC Bioinformatics**, 14:160. doi:10.1186/1471-2105-14-160.

Deorowicz S, Kokot M, Grabowski S, Debudaj-Grabysz A (2015) *KMC 2: Fast and resource-frugal k-mer counting*. **Bioinformatics**, 31(10):1569-1576. doi:10.1093/bioinformatics/btv022.

Desai A, Marwah VS, Yadav A, Jha V, Dhaygude K, Bangar U, Kulkarni V, Jere A (2013) *Identification of Optimum Sequencing Depth Especially for De Novo Genome Assembly of Small Genomes Using Next Generation Sequencing Data*. **PLoS ONE**, 8(4):e60204. doi:10.1371/journal.pone.0060204.

Durai DA, Schulz MH (2019) *Improving in-silico normalization using read weights*. **Scientific Reports**, 9:5133. doi:10.1038/s41598-019-41502-9.

Kokot M, Długosz M, Deorowicz S (2017) *KMC 3: counting and manipulating k-mer statistics*. **Bioinformatics**, 33(17):2759-2761. doi:10.1093/bioinformatics/btx304.

Liu B, Shi Y, Yuan J, Hu X, Zhang H, Li N, Li Z, Chen Y, Mu D, Fan W (2013) *Estimation of genomic characteristics by analyzing k-mer frequency in de novo genome projects*. **arXiv**:1308.2012v2.

Melsted P, Halldórsson BV (2014) *KmerStream: streaming algorithms for k-mer abundance estimation*. **Bioinformatics**, 30(24):3541-3547. doi:10.1093/bioinformatics/btu713.

Mohamadi H, Khan H, Birol I (2017) *ntCard: a streaming algorithm for cardinality estimation in genomics data*. **Bioinformatics**, 33(9):1324-1330. doi:10.1093/bioinformatics/btw832.

Wedemeyer A, Kliemann L, Srivastav A, Schielke C, Reusch TB, Rosenstiel P (2017) *An improved filtering algorithm for big read datasets and its application to single-cell assembly*. **BMC Bioinformatics**, 18:324. doi:10.1186/s12859-017-1724-7.