ROCK

ROCK (Reducing Over-Covering K-mers) is a command line program written in C++ that runs an alternative implementation of the digital normalization method (e.g. Brown et al. 2012, Wedemeyer et al. 2017, Durai and Schulz 2019).

Given one or several FASTQ file(s), the main aim of ROCK is to build a subset of accurate high-throughput sequencing (HTS) reads such that the induced coverage depth is comprised between two specified lower- and upper-bounds. ROCK can therefore be used to reduce and/or homogenize the overall coverage depth within large sets of HTS reads, which is often required to quickly infer accurate de novo genome assemblies (e.g. Desai et al. 2013, Chen et al. 2015). ROCK can also be used to discard low-covering HTS reads, as those ones are often artefactual, highly erroneous or contaminating sequences.

Usage

Run ROCK without option to read the following documentation:

Reducing Over-Covering K-mers within FASTQ file(s)

USAGE: rock [options] [files]

OPTIONS:
 -i <file>  file containing the name(s)  of the input FASTQ file(s) to
            process;  single-end: one file name per line;  paired-end:
            two file names  per line  separated by  a comma;  up to 15
            FASTQ file  names can  be specified;  of note,  input file
            name(s) can also be specified as program argument(s)
 -o <file>  file containing the  name(s) of the  output FASTQ file(s);
            FASTQ file name(s) should be structured in the same way as
            the file specified in option -i.
 -k <int>   k-mer length (default 25)
 -c <int>   lower-bound k-mer coverage depth threshold (default: 0)
 -C <int>   upper-bound k-mer coverage depth threshold (default: 70)
 -l <int>   number of hashing function(s) (default: 4)
 -n <int>   expected total number of  distinct k-mers within the input
            read sequences; not compatible with option -l.
 -f <float> maximum expected false positive probability when computing
            the optimal number of hashing functions from the number of
            distinct k-mers specified with option -n (default: 0.05).
 -q <int>   sets as valid  only k-mers  made  up  of  nucleotides with
            Phred score (+33 offset) above this cutoff (default: 0)
 -m <int>   minimum number of  valid k-mer(s) to  consider a read; all
            non-considered reads are  written into output file(s) with
            extension undefined.fastq (default: 1)
 -v         verbose mode
 -h         prints this message and exit

Notes

References

Brown CT, Howe A, Zhang Q, Pyrkosz AB, Brom YH (2012) A Reference-Free Algorithm for Computational Normalization of Shotgun Sequencing Data. arXiv:1203.4802v2.

Chen TW, Gan RC, Chang YF, Liao W-C, Wu TH, Lee C-C, Huang P-J, Lee C-Y, Chen Y-YM, Chiu CH, Tang P (2015) Is the whole greater than the sum of its parts? De novo assembly strategies for bacterial genomes based on paired-end sequencing. BMC Genomics, 16:648. doi:10.1186/s12864-015-1859-8.

Cormode G, Muthukrishnan S (2005) An Improved Data Stream Summary: The Count-Min Sketch and its Applications. Journal of Algorithms, 55:29-38. doi:10.1016/j.jalgor.2003.12.001.

Deorowicz S, Debudaj-Grabysz A, Grabowski S (2013) Disk-based k-mer counting on a PC. BMC Bioinformatics, 14:160. doi:10.1186/1471-2105-14-160.

Deorowicz S, Kokot M, Grabowski S, Debudaj-Grabysz A (2015) KMC 2: Fast and resource-frugal k-mer counting. Bioinformatics, 31(10):1569-1576. doi:10.1093/bioinformatics/btv022.

Desai A, Marwah VS, Yadav A, Jha V, Dhaygude K, Bangar U, Kulkarni V, Jere A (2013) Identification of Optimum Sequencing Depth Especially for De Novo Genome Assembly of Small Genomes Using Next Generation Sequencing Data. PLoS ONE, 8(4):e60204. doi:10.1371/journal.pone.0060204.

Durai DA, Schulz MH (2019) Improving in-silico normalization using read weights. Scientific Reports, 9:5133. doi:10.1038/s41598-019-41502-9.

Kokot M, Długosz M, Deorowicz S (2017) KMC 3: counting and manipulating k-mer statistics. Bioinformatics, 33(17):2759-2761. doi:10.1093/bioinformatics/btx304.

Liu B, Shi Y, Yuan J, Hu X, Zhang H, Li N, Li Z, Chen Y, Mu D, Fan W (2013) Estimation of genomic characteristics by analyzing k-mer frequency in de novo genome projects. arXiv:1308.2012v2.

Melsted P, Halldórsson BV (2014) KmerStream: streaming algorithms for k-mer abundance estimation. Bioinformatics, 30(24):3541-3547. doi:10.1093/bioinformatics/btu713.

Mohamadi H, Khan H, Birol I (2017) ntCard: a streaming algorithm for cardinality estimation in genomics data. Bioinformatics, 33(9):1324-1330. doi:10.1093/bioinformatics/btw832.

Wedemeyer A, Kliemann L, Srivastav A, Schielke C, Reusch TB, Rosenstiel P (2017) An improved filtering algorithm for big read datasets and its application to single-cell assembly. BMC Bioinformatics, 18:324. doi:10.1186/s12859-017-1724-7.