fastq_info is a command line program written in Bash/AWK for quickly estimating several standard descriptive statistics from FASTQ-formatted High-Throughput Sequencing (HTS) files. Estimated statistics per FASTQ file are:
▹ HTS read and base numbers,
▹ HTS read length distribution,
▹ GC-content per HTS read position,
▹ Phred score distribution (global and for each HTS read position),
▹ average Phred score per HTS read distribution.
Several output result formats are available (e.g. reduced/full table, tab-delimited).
Clone this repository with the following command line:
git clone https://gitlab.pasteur.fr/GIPhy/fastq_info.git
Go to the directory fastq_info/
to give the execute permission to the file fastq_info.sh
:
chmod +x fastq_info.sh
and run it with the following command line model:
./fastq_info.sh [options]
fastq_info requires an AWK interpreter in the $PATH
, which is always the case for most Linux distributions. By default, fastq_info first considers gawk (GNU awk, generally available on recent Linux distributions); otherwise the basic command awk
in the $PATH
is used. However, alternative implementations of AWK can be specified using option -a
(see Usage section). In practice, fastq_info is able to detect and deal with most AWK interpreters (e.g. nawk, mawk, goawk).
To use fastq_info with standard FASTQ compression formats, it is expected that the following binaries are available in the $PATH
:
+ gzip, required to deal with files compressed using gzip;
+ bzip2, required to deal with files compressed using bzip2;
+ pigz, expected to deal with files compressed using gzip on multiple threads (when not installed, gzip is used instead);
+ pbzip2, expected to deal with files compressed using bzip2 on multiple threads (when not installed, bzip2 is used instead);
+ dsrc, required to deal with files compressed using DSRC 2.0 RC/RC2 (Roguski and Deorowicz 2014);
+ fqzcomp, required to deal with files compressed using fqzcomp 4.0 (Bonfield and Mahoney 2013);
+ quip, required to deal with files compressed using QUIP (Jones et al. 2012).
To run fastq_info, it is not required to install all these binaries, but the dedicated tool(s) should be available depending on the compression format of the input files.
Run fastq_info without option to read the following documentation:
USAGE: fastq_info.sh [options] [<file1> <file2> ...]
Allowed file extensions (case insensitive):
.bz
.bz2
.bzip
.bzip2 ... considered as FASTQ-formatted files compressed using bzip2;
decompressed using bunzip2 or pbzip2 (when available in
$PATH)
.dsrc
.dsrc2 ... considered as FASTQ-formatted files compressed using DSRC
v2.0 (sun.aei.polsl.pl/dsrc); decompressed using DSRC v2.0
(when available in $PATH)
.fastq
.fq
.txt ..... considered as uncompressed FASTQ-formatted files
.fqz ..... considered as FASTQ-formatted files compressed using
fqzcomp v4 (github.com/jkbonfield/fqzcomp); decompressed
using fqzcomp v4 (when available in $PATH)
.gz
.gzip .... considered as FASTQ-formatted files compressed using gzip;
decompressed using gunzip or pigz (when available in $PATH)
.qp ...... considered as FASTQ-formatted files compressed using QUIP
(github.com/dcjones/quip); decompressed using QUIP (when
available in $PATH)
Options:
-s <int> speed index between 1 (slower) and 9 (faster) to manage the
subsampling rate; when set to 1, 2, 3, 4 or 5, results are
based on 100%, ~33%, ~20%, ~15% or ~10% of all FASTQ blocks
(default: 5)
-v <char> reduced (r), full (f) or tab-delimited (t) result output
(default: r)
-p <int> Phred quality offset (default: 33)
-d DOS end-of-lines in input file(s) (default: not set)
-t <int> number of thread(s) for decompressing files (default: 1)
-a AWK interpreter (default: gawk or awk in $PATH)
-c checks available tools (default: not set)
-h prints this help and exits
Each input file is decompressed (if required) and parsed. Numbers of High-Throughput Sequencing (HTS) reads and bases are exact, as well as the derived average HTS read length. All other descriptive statistics are estimated by an AWK program based on FASTQ block subsampling (except when setting option -s 1
). Low subsampling rate (e.g. ~10% by default) is generally sufficient to obtain results representative of the whole set of HTS reads.
fastq_info is able to consider many input files summarized using filename expansion, e.g. dirname/*.fastq.gz
In outputted results, every empty entry is indicated by a dot instead of zero.
Tab-delimited option -v t
enables to output only several statistics: numbers of HTS reads and bases (NR and NB, respectively), average HTS read length (AL), the three quartiles of the global Phred score distribution (BQ1, BQ2, BQ3) and the three quartiles of the average Phred score per HTS read distribution (RQ1, RQ2, RQ3). For detailed distributions per HTS read position and/or Phred score value, use options -v r
or -v f
.
Specific AWK interpreters can be used via the option -a
(either a name within the $PATH
or the full path to a binary). fastq_info was successfully run together with gawk, nawk, mawk, and goawk. However, faster running times were generally observed using gawk versions ≥ 4.0 (on Linux).
Option -c
can be useful to obtain a check list of the required/expected binaries available in the $PATH
, as well as their respective version (especially for the AWK interpreter).
Option -d
can be useful when dealing with FASTQ files containing non-Unix end-of-lines (e.g. created under Microsoft Windows).
The following Bash command line enables to download the pair of gzipped FASTQ files SRR001666_1.fastq.gz and SRR001666_2.fastq.gz to be used as examples:
wget -q ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR001/SRR001666/SRR001666*.fastq.gz
The following command line runs fastq_info.sh
to analyze the second (i.e. R2) downloaded file :
./fastq_info.sh SRR001666_2.fastq.gz
leading to the following standard output:
##File: SRR001666_2.fastq.gz
#no.reads: 7047668
#no.bases: 253716048
#avg.lgt: 36.0
------------------------------
pos Lfreq GCfreq Q1 Q2 Q3
----- ------ ------ -- -- --
1 . 47.61 40 40 40
2 . 47.18 40 40 40
3 . 49.13 40 40 40
4 . 50.43 40 40 40
5 . 49.95 40 40 40
6 . 51.04 40 40 40
7 . 50.40 40 40 40
8 . 50.22 40 40 40
9 . 50.63 40 40 40
10 . 50.74 40 40 40
11 . 50.72 40 40 40
12 . 51.04 39 40 40
13 . 50.77 35 40 40
14 . 50.69 33 40 40
15 . 51.01 31 40 40
16 . 51.08 28 40 40
17 . 51.07 25 40 40
18 . 51.44 24 40 40
19 . 51.40 22 39 40
20 . 51.25 20 35 40
21 . 51.52 19 33 40
22 . 51.46 18 31 40
23 . 51.24 16 28 40
24 . 51.53 15 26 40
25 . 51.55 13 24 40
26 . 51.46 12 22 38
27 . 51.75 12 21 36
28 . 51.72 11 19 33
29 . 51.45 10 18 32
30 . 51.85 10 17 29
31 . 51.43 9 16 28
32 . 51.36 9 15 26
33 . 51.74 8 14 24
34 . 51.73 7 13 23
35 . 51.63 7 12 22
36 100.00 51.70 6 12 22
----- ------ ------ -- -- --
Q1 Q2 Q3
-- -- --
bases 18 40 40
reads 26 30 34
------------------------------
The first part of the outputted table is made up by one row per HTS read position (column pos
). For each pos
value (varying from 1 to the largest observed HTS read length), the corresponding row indicates the percentage of HTS read of length being equal to pos
(column Lfreq
), the percentage of observed GC bases (column GCfreq
), and the 1st, 2nd and 3rd quartiles of observed Phred scores (columns Q1
, Q2
and Q3
, respectively). The bottom part of the table summarizes the global Phred score distribution (row bases
: three quartiles Q1
, Q2
and Q3
), and the average Phred score per HTS read distribution (last row reads
: three quartiles Q1
, Q2
and Q3
).
The above example therefore shows that the majority of Phred scores are decreasing below Q = 20 at positions 28-36 (i.e. the median Phred score Q2 is lower than 20 as of HTS read position 28). At least 25% of all sequenced bases are associated to Phred scores < 19 (i.e. first quartile Q1 = 18 in row bases
), but at least 50% of the HTS reads have an average Phred score > 29 (median Q2 = 30 in row reads
)
For more details (i.e. one supplementary column for each observed Phred score), a full table can be outputted using option -v f
:
./fastq_info.sh -v f SRR001666_2.fastq.gz
##File: SRR001666_2.fastq.gz
#no.reads: 7047668
#no.bases: 253716048
#avg.lgt: 36.0
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
pos Lfreq GCfreq Q1 Q2 Q3 Q= 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
----- ------ ------ -- -- -- ------ ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- -----
1 . 47.61 40 40 40 0.1 0.0 0.0 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.2 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 94.3
2 . 47.18 40 40 40 0.1 0.0 0.0 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.2 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 94.2
3 . 49.13 40 40 40 0.2 0.1 0.0 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.2 0.1 0.1 0.1 0.1 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 92.3
4 . 50.43 40 40 40 0.2 0.1 0.0 0.1 0.1 0.2 0.1 0.1 0.1 0.1 0.2 0.1 0.1 0.1 0.1 0.1 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 92.3
5 . 49.95 40 40 40 0.2 0.1 0.0 0.1 0.1 0.2 0.1 0.1 0.1 0.1 0.3 0.1 0.1 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 91.5
6 . 51.04 40 40 40 0.2 0.1 0.0 0.1 0.1 0.2 0.1 0.1 0.1 0.1 0.3 0.1 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.4 0.3 0.4 0.4 0.4 0.4 0.4 90.6
7 . 50.40 40 40 40 0.2 0.1 0.0 0.1 0.2 0.2 0.1 0.1 0.1 0.2 0.3 0.2 0.2 0.2 0.2 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 87.4
8 . 50.22 40 40 40 0.2 0.1 0.0 0.1 0.2 0.2 0.1 0.1 0.1 0.2 0.4 0.2 0.2 0.2 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 86.8
9 . 50.63 40 40 40 0.2 0.1 0.0 0.1 0.2 0.3 0.2 0.2 0.2 0.2 0.5 0.3 0.3 0.3 0.3 0.4 0.4 0.4 0.4 0.4 0.5 0.5 0.5 0.5 0.5 0.5 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.7 0.7 0.7 0.7 0.7 0.7 0.7 82.7
10 . 50.74 40 40 40 0.2 0.1 0.1 0.1 0.3 0.3 0.2 0.2 0.2 0.2 0.6 0.3 0.3 0.4 0.4 0.4 0.4 0.5 0.5 0.5 0.5 0.5 0.5 0.6 0.6 0.6 0.6 0.6 0.6 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.8 0.8 80.6
11 . 50.72 40 40 40 0.3 0.1 0.1 0.2 0.3 0.3 0.2 0.2 0.2 0.3 0.6 0.3 0.4 0.4 0.4 0.5 0.5 0.5 0.5 0.6 0.6 0.6 0.6 0.6 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8 78.6
12 . 51.04 39 40 40 0.2 0.1 0.1 0.2 0.3 0.4 0.2 0.3 0.3 0.3 0.7 0.4 0.5 0.5 0.5 0.6 0.6 0.6 0.7 0.7 0.7 0.7 0.8 0.8 0.8 0.8 0.8 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9 1.0 0.9 1.0 1.0 74.4
13 . 50.77 35 40 40 0.2 0.2 0.1 0.2 0.3 0.4 0.2 0.3 0.3 0.4 0.8 0.5 0.5 0.6 0.6 0.7 0.7 0.8 0.8 0.8 0.8 0.9 0.9 0.9 1.0 1.0 1.0 1.0 1.0 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 70.1
14 . 50.69 33 40 40 0.2 0.2 0.1 0.2 0.4 0.5 0.3 0.3 0.3 0.4 0.9 0.5 0.6 0.6 0.7 0.7 0.8 0.8 0.8 0.9 0.9 1.0 1.0 1.0 1.0 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.2 1.1 1.2 1.1 1.2 1.1 1.1 1.1 67.8
15 . 51.01 31 40 40 0.2 0.2 0.1 0.2 0.4 0.5 0.3 0.3 0.4 0.5 1.1 0.6 0.7 0.7 0.8 0.8 0.9 0.9 1.0 1.0 1.0 1.1 1.1 1.1 1.1 1.2 1.2 1.2 1.2 1.2 1.2 1.2 1.2 1.2 1.2 1.2 1.2 1.2 1.2 1.2 64.9
16 . 51.08 28 40 40 0.2 0.3 0.1 0.3 0.5 0.6 0.4 0.5 0.5 0.6 1.3 0.8 0.8 0.9 0.9 1.0 1.1 1.1 1.1 1.2 1.2 1.2 1.3 1.3 1.3 1.3 1.3 1.3 1.3 1.4 1.3 1.3 1.3 1.3 1.3 1.3 1.3 1.3 1.2 1.2 59.9
17 . 51.07 25 40 40 0.2 0.3 0.2 0.3 0.6 0.8 0.5 0.6 0.6 0.7 1.6 0.9 1.0 1.1 1.1 1.2 1.2 1.3 1.3 1.4 1.4 1.4 1.4 1.5 1.5 1.5 1.4 1.5 1.5 1.5 1.4 1.5 1.4 1.4 1.4 1.4 1.4 1.3 1.3 1.3 54.9
18 . 51.44 24 40 40 0.4 0.3 0.2 0.4 0.6 0.8 0.5 0.6 0.7 0.7 1.7 1.0 1.1 1.1 1.2 1.3 1.3 1.4 1.4 1.5 1.5 1.5 1.6 1.5 1.5 1.6 1.6 1.6 1.5 1.6 1.6 1.5 1.5 1.5 1.4 1.4 1.4 1.4 1.3 1.3 52.2
19 . 51.40 22 39 40 0.2 0.3 0.2 0.4 0.7 0.9 0.6 0.7 0.7 0.9 2.0 1.1 1.2 1.3 1.4 1.4 1.5 1.5 1.6 1.6 1.6 1.7 1.6 1.7 1.6 1.7 1.7 1.6 1.6 1.6 1.6 1.6 1.5 1.5 1.5 1.4 1.4 1.4 1.3 1.3 48.8
20 . 51.25 20 35 40 0.2 0.4 0.2 0.5 0.9 1.1 0.7 0.8 0.9 1.0 2.4 1.3 1.4 1.5 1.6 1.6 1.7 1.7 1.8 1.8 1.8 1.8 1.8 1.8 1.8 1.8 1.8 1.8 1.7 1.7 1.6 1.6 1.6 1.5 1.5 1.4 1.4 1.4 1.3 1.3 43.9
21 . 51.52 19 33 40 0.3 0.4 0.3 0.6 1.0 1.3 0.8 0.9 1.0 1.1 2.6 1.5 1.6 1.6 1.8 1.8 1.8 1.9 1.9 1.9 1.9 1.9 1.9 1.9 1.9 1.8 1.8 1.8 1.8 1.7 1.7 1.6 1.6 1.5 1.5 1.4 1.4 1.4 1.3 1.3 40.8
22 . 51.46 18 31 40 0.2 0.5 0.3 0.7 1.1 1.4 0.9 1.0 1.2 1.3 3.0 1.7 1.7 1.8 1.9 2.0 2.0 2.0 2.1 2.0 2.1 2.1 2.0 2.0 2.0 1.9 1.9 1.9 1.8 1.8 1.7 1.6 1.6 1.5 1.5 1.4 1.4 1.3 1.3 1.2 37.3
23 . 51.24 16 28 40 0.3 0.6 0.4 0.8 1.2 1.7 1.0 1.2 1.3 1.5 3.4 1.9 2.0 2.1 2.1 2.2 2.2 2.3 2.3 2.2 2.2 2.2 2.1 2.1 2.1 2.0 1.9 1.8 1.8 1.7 1.7 1.6 1.6 1.5 1.4 1.4 1.3 1.3 1.2 1.2 33.3
24 . 51.53 15 26 40 0.3 0.7 0.4 0.9 1.4 1.9 1.2 1.4 1.6 1.7 3.9 2.1 2.2 2.3 2.4 2.4 2.4 2.4 2.4 2.4 2.3 2.3 2.2 2.2 2.1 2.1 1.9 1.9 1.8 1.7 1.6 1.6 1.5 1.5 1.4 1.3 1.2 1.2 1.2 1.1 29.5
25 . 51.55 13 24 40 0.2 0.8 0.5 1.0 1.7 2.3 1.4 1.6 1.8 2.0 4.4 2.4 2.5 2.6 2.6 2.6 2.6 2.6 2.5 2.5 2.4 2.3 2.2 2.2 2.1 2.0 1.9 1.8 1.7 1.7 1.6 1.5 1.4 1.4 1.3 1.2 1.2 1.1 1.1 1.0 26.0
26 . 51.46 12 22 38 0.2 1.0 0.6 1.2 2.0 2.6 1.6 1.8 2.0 2.2 4.9 2.6 2.7 2.7 2.7 2.7 2.7 2.7 2.6 2.5 2.4 2.4 2.2 2.2 2.1 2.0 1.9 1.8 1.7 1.6 1.5 1.5 1.4 1.3 1.2 1.2 1.1 1.1 1.0 1.0 23.4
27 . 51.75 12 21 36 0.2 1.2 0.7 1.4 2.2 2.9 1.8 2.0 2.2 2.4 5.2 2.8 2.8 2.9 2.8 2.8 2.8 2.7 2.7 2.6 2.4 2.3 2.3 2.1 2.0 2.0 1.8 1.7 1.6 1.6 1.5 1.4 1.3 1.2 1.2 1.1 1.1 1.0 1.0 0.9 21.3
28 . 51.72 11 19 33 0.2 1.3 0.8 1.6 2.5 3.3 2.0 2.2 2.4 2.6 5.7 3.0 3.0 3.0 3.0 2.9 2.8 2.7 2.7 2.5 2.4 2.3 2.2 2.1 2.0 1.9 1.8 1.7 1.6 1.5 1.4 1.3 1.2 1.2 1.1 1.0 1.0 0.9 0.9 0.8 19.1
29 . 51.45 10 18 32 0.4 1.6 0.9 1.8 2.8 3.6 2.2 2.4 2.6 2.8 6.0 3.1 3.1 3.1 3.1 2.9 2.9 2.8 2.7 2.6 2.4 2.3 2.2 2.1 1.9 1.8 1.7 1.6 1.5 1.4 1.3 1.3 1.2 1.1 1.0 1.0 0.9 0.9 0.8 0.8 17.3
30 . 51.85 10 17 29 0.3 1.8 1.1 2.1 3.2 4.1 2.5 2.7 2.8 3.0 6.4 3.3 3.3 3.2 3.1 3.0 2.9 2.8 2.7 2.5 2.4 2.2 2.1 2.0 1.8 1.8 1.6 1.5 1.4 1.4 1.2 1.2 1.1 1.0 1.0 0.9 0.8 0.8 0.7 0.7 15.4
31 . 51.43 9 16 28 0.2 2.1 1.3 2.4 3.5 4.6 2.6 2.9 3.1 3.2 6.8 3.4 3.4 3.3 3.2 3.1 2.9 2.8 2.6 2.5 2.3 2.2 2.0 1.9 1.8 1.7 1.5 1.4 1.4 1.3 1.2 1.1 1.0 1.0 0.9 0.8 0.8 0.7 0.7 0.6 13.8
32 . 51.36 9 15 26 0.3 2.5 1.4 2.6 3.9 4.9 2.9 3.1 3.4 3.4 7.1 3.5 3.4 3.3 3.2 3.1 2.9 2.8 2.6 2.4 2.2 2.1 2.0 1.8 1.7 1.6 1.4 1.4 1.3 1.2 1.1 1.0 1.0 0.9 0.8 0.8 0.7 0.7 0.6 0.6 12.7
33 . 51.74 8 14 24 0.2 2.9 1.7 3.0 4.3 5.5 3.1 3.4 3.6 3.7 7.3 3.6 3.5 3.4 3.2 3.0 2.8 2.7 2.5 2.3 2.2 2.0 1.9 1.7 1.6 1.5 1.4 1.3 1.2 1.1 1.0 0.9 0.9 0.8 0.7 0.7 0.7 0.6 0.6 0.5 11.1
34 . 51.73 7 13 23 0.2 3.3 1.9 3.3 4.8 5.9 3.4 3.6 3.8 3.9 7.7 3.7 3.6 3.4 3.2 3.0 2.8 2.6 2.4 2.2 2.1 1.9 1.7 1.6 1.5 1.4 1.2 1.2 1.1 1.0 0.9 0.9 0.8 0.7 0.7 0.7 0.6 0.6 0.5 0.5 9.7
35 . 51.63 7 12 22 0.3 3.9 2.1 3.7 5.2 6.4 3.5 3.8 3.9 4.0 7.8 3.7 3.5 3.4 3.2 3.0 2.7 2.5 2.3 2.1 2.0 1.8 1.7 1.5 1.4 1.3 1.2 1.1 1.0 1.0 0.9 0.8 0.7 0.7 0.6 0.6 0.6 0.5 0.5 0.5 8.8
36 100.00 51.70 6 12 22 0.2 4.1 2.4 4.0 5.5 6.7 3.6 3.8 3.9 3.9 7.6 3.6 3.5 3.3 3.0 2.8 2.6 2.4 2.2 2.1 1.9 1.8 1.6 1.5 1.4 1.3 1.2 1.1 1.0 0.9 0.8 0.8 0.7 0.7 0.6 0.6 0.5 0.5 0.5 0.4 9.0
----- ------ ------ -- -- -- ------ ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- -----
Q1 Q2 Q3 Q= 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
-- -- -- ------ ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- -----
bases 18 40 40 0.2 0.9 0.5 1.0 1.5 1.9 1.1 1.2 1.3 1.4 2.9 1.5 1.6 1.6 1.6 1.5 1.5 1.5 1.5 1.4 1.4 1.4 1.3 1.3 1.2 1.2 1.2 1.1 1.1 1.1 1.0 1.0 1.0 1.0 0.9 0.9 0.9 0.9 0.8 0.8 51.0
reads 26 30 34 0.1 0.0 0.0 0.0 0.1 0.1 0.1 0.1 0.2 0.2 0.2 0.2 0.3 0.3 0.4 0.4 0.5 0.6 0.8 1.0 1.3 1.6 2.2 2.7 3.5 4.2 5.1 5.8 6.5 7.0 7.3 7.3 7.1 6.7 6.2 5.5 4.8 3.9 2.9 1.9 0.7
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
#subsampling.rate: 0.111
To help with reading, the main statistics for all files can be summarized in tab-delimited format using option -v t
:
./fastq_info.sh -v t SRR001666*.fastq.gz
#File NR NB AL BQ1 BQ2 BQ3 RQ1 RQ2 RQ3
SRR001666_1.fastq.gz 7047668 253716048 36.0 30 40 40 32 35 37
SRR001666_2.fastq.gz 7047668 253716048 36.0 18 40 40 26 30 34
This simple output format enables to easily read every file name (#File
), no. HTS reads (NR
) and bases (NB
), average HTS read length (AL
), as well as the three quartiles of the global Phred score distribution (BQ1
, BQ2
, BQ3
) and of the average Phred score per HTS read distribution (RQ1
, RQ2
, RQ3
).
The above example clearly shows that the overall sequencing error rate is lower in file SRR001666_1.fastq.gz than in file SRR001666_2.fastq.gz.
By default, all distributions are estimated from a subset of all input FASTQ blocks to obtain fast running times (the subsampling rate is indicated when using option -v f
). In almost all cases, default subsampling rate (i.e. ~10% with option -s 5
) is sufficient to efficiently approximate the different distributions (i.e. HTS read lengths, GC-content, Phred scores).
For example, the below command line uses all FASTQ blocks from each input file (i.e. option -s 1
):
./fastq_info.sh -s 1 -v t SRR001666*.fastq.gz
#File NR NB AL BQ1 BQ2 BQ3 RQ1 RQ2 RQ3
SRR001666_1.fastq.gz 7047668 253716048 36.0 30 40 40 32 35 37
SRR001666_2.fastq.gz 7047668 253716048 36.0 18 40 40 26 30 34
All statistics are identical to the ones previously estimated (see above), but the overall running time was 8 times slower… For comparison, when used with default options, fastq_info is expected to run ~1.5 times faster than FastQC to process one FASTQ file.
Bonfield JK, Mahoney MV (2013) Compression of FASTQ and SAM format sequencing data. PLOS One, 8(3):e59190. doi:10.1371/journal.pone.0059190.
Jones DC, Ruzzo WL, Peng X, Katze MG (2012) Compression of next-generation sequencing reads aided by highly efficient de novo assembly. Nucleic Acids Research, 40(22):e171–e171. doi:10.1093/nar/gks754.
Roguski L, Deorowicz S (2014) DSRC 2 - Industry-oriented compression of FASTQ files. Bioinformatics, 30(15):2213-2215. doi:10.1093/bioinformatics/btu208.