fastq_info is a command line program written in Bash/AWK for quickly estimating several standard descriptive statistics from FASTQ-formatted High-Throughput Sequencing (HTS) read files. Estimated statistics per FASTQ file are:
▹ numbers of HTS reads and bases,
▹ distribution of HTS read lengths,
▹ nucleotide residue content per HTS read position,
▹ distribution of Phred scores (Ewing and Green 1998), and corresponding quartiles,
▹ distribution of Phred scores per HTS read position, and corresponding quartiles,
▹ distribution of the average Phred score per HTS read, and corresponding quartiles,
▹ distribution of the number of sequencing error(s) per HTS read (Edgar and Flyvbjerg 2015), and corresponding quartiles.
Different file compression formats can be handled (e.g. gzip, bzip2, DSRC 2.0). Several output result formats are available (e.g. reduced/full tables, tab-delimited).
Clone this repository with the following command line:
git clone https://gitlab.pasteur.fr/GIPhy/fastq_info.git
Go to the directory fastq_info/
to give the execute permission to the file fastq_info.sh
:
cd fastq_info/
chmod +x fastq_info.sh
and run it with the following command line model:
./fastq_info.sh [options]
fastq_info requires an AWK interpreter in the $PATH
, which is always the case for most Linux distributions. By default, fastq_info first considers gawk (GNU awk, generally available on recent Linux distributions); otherwise the basic command awk
in the $PATH
is used. However, alternative implementations of AWK can be specified using option -a
(see Usage section). In practice, as two different POSIX codes are implemented (i.e. original One-True-AWK and extended GNU AWK), fastq_info is able to detect and deal with almost all AWK interpreters (e.g. nawk, mawk, goawk).
To use fastq_info with standard FASTQ compression formats, it is expected that the following binaries are available in the $PATH
:
+ gzip, required to deal with files compressed using gzip;
+ bzip2, required to deal with files compressed using bzip2;
+ pigz, expected to deal with files compressed using gzip on multiple threads (when not installed, gzip is used instead);
+ pbzip2, expected to deal with files compressed using bzip2 on multiple threads (when not installed, bzip2 is used instead);
+ dsrc, required to deal with files compressed using DSRC 2.0 RC/RC2 (Roguski and Deorowicz 2014);
+ fqzcomp, required to deal with files compressed using fqzcomp 4.0 (Bonfield and Mahoney 2013);
+ quip, required to deal with files compressed using QUIP (Jones et al. 2012).
To run fastq_info, it is not required to install all these binaries, but the dedicated tool(s) should be available depending on the compression format of the input files.
Run fastq_info without option to read the following documentation:
USAGE: fastq_info.sh [options] [<file1> <file2> ...]
Allowed file extensions (case insensitive):
.bz
.bz2
.bzip
.bzip2 ... considered as FASTQ-formatted files compressed using bzip2;
decompressed using bunzip2 or pbzip2 (when available in
$PATH)
.dsrc
.dsrc2 ... considered as FASTQ-formatted files compressed using DSRC
v2.0 (sun.aei.polsl.pl/dsrc); decompressed using DSRC v2.0
(when available in $PATH)
.fastq
.fq ...... considered as uncompressed FASTQ-formatted files
.fqz ..... considered as FASTQ-formatted files compressed using
fqzcomp v4 (github.com/jkbonfield/fqzcomp); decompressed
using fqzcomp v4 (when available in $PATH)
.gz
.gzip .... considered as FASTQ-formatted files compressed using gzip;
decompressed using gunzip or pigz (when available in $PATH)
.qp ...... considered as FASTQ-formatted files compressed using QUIP
(github.com/dcjones/quip); decompressed using QUIP (when
available in $PATH)
Options:
-s <int> speed index between 1 (slower) and 9 (faster) to manage the
subsampling rate; when set to 1, 2, 3, 4 or 5, results are
based on 100%, ~33%, ~20%, ~15% or ~10% of all FASTQ blocks
(default: 5)
-v <char> reduced table (r), full table (f), or tab-delimited (t)
output (default: r)
-r residue content per position in output table (default: not
set)
-p <int> Phred quality offset (default: 33)
-d DOS end-of-lines in input file(s) (default: not set)
-t <int> number of thread(s) for decompressing files (default: 1)
-a <prog> AWK interpreter (default: gawk or awk in $PATH)
-c checks available tools (default: not set)
-h prints this help and exits
Each input file is decompressed (if required, depending on its extension) and parsed. Numbers of High-Throughput Sequencing (HTS) reads and bases are exact, as well as the derived average HTS read length. All other descriptive statistics are estimated by an AWK program based on FASTQ block subsampling (except when setting option -s 1
). Low subsampling rate (e.g. ~10% by default) is generally sufficient to obtain results representative of the whole set of HTS reads. For very fast running times, use option -s 9
(i.e. subsampling rate ~6% ), but at the cost of reduced accuracy.
fastq_info is able to consider many input files summarized using filename expansion, e.g. dirname/*.fastq.gz
. A warning is displayed when an input file is not a regular one (e.g. directory) or when its extension is unknown. All warning messages can be disabled by ending the command line with 2>/dev/null
.
In outputted results, every empty entry is indicated by a dot instead of zero.
Tab-delimited option -v t
enables to output only several statistics: numbers of HTS reads and bases (NR and NB, respectively), average HTS read length (AL), the three quartiles of the distribution of the Phred scores (BQ1, BQ2, BQ3), the three quartiles of the distribution of the average Phred score per HTS read (RQ1, RQ2, RQ3) and the three quartiles of the distribution of the (expected) number of sequencing error(s) per HTS read (EQ1, EQ2, EQ3). For detailed distributions per HTS read position and/or Phred score value, use options -v r
or -v f
.
Specific AWK interpreters can be used via the option -a
(to set either a name within the $PATH
or the full path to a binary). fastq_info was successfully run together with gawk, nawk, mawk, and goawk. However, faster running times were generally observed using gawk versions ≥ 4.0 (mawk remains acceptable, but goawk is not recommended).
Option -c
can be useful to obtain a check list of the required/expected binaries available in the $PATH
, as well as their respective version (especially for the AWK interpreter).
Option -d
can be useful when dealing with FASTQ files containing non-Unix end-of-lines (e.g. created under Microsoft Windows).
The following Bash command line enables to download the pair of gzipped FASTQ files SRR001666_1.fastq.gz and SRR001666_2.fastq.gz to be used as examples:
wget -q ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR001/SRR001666/SRR001666*.fastq.gz
The following command line runs fastq_info.sh
to analyze the second (i.e. R2) downloaded file :
fastq_info.sh SRR001666_2.fastq.gz
leading to the following standard output:
##File: SRR001666_2.fastq.gz
#no.reads(NR): 7047668
#no.bases(NB): 253716048
#avg.lgt(AL): 36.0
-----------------------------
n Lfreq Efreq Q1 Q2 Q3
----- ------ ------ -- -- --
0 . 47.17 . . .
1 . 26.28 40 40 40
2 . 12.42 40 40 40
3 . 5.97 40 40 40
4 . 2.98 40 40 40
5 . 1.64 40 40 40
6 . 1.00 40 40 40
7 . 0.63 40 40 40
8 . 0.45 40 40 40
9 . 0.32 40 40 40
10 . 0.25 40 40 40
11 . 0.17 40 40 40
12 . 0.13 39 40 40
13 . 0.11 35 40 40
14 . 0.09 33 40 40
15 . 0.06 31 40 40
16 . 0.05 28 40 40
17 . 0.03 25 40 40
18 . 0.02 24 40 40
19 . 0.01 22 39 40
20 . 0.01 20 35 40
21 . 0.01 19 33 40
22 . 0.01 18 31 40
23 . 0.01 16 28 40
24 . 0.01 15 26 40
25 . 0.01 13 24 40
26 . 0.01 12 22 38
27 . 0.01 12 21 36
28 . 0.01 11 19 33
29 . 0.01 10 18 32
30 . 0.01 10 17 29
31 . 0.01 9 16 28
32 . 0.01 9 15 26
33 . 0.01 8 14 24
34 . 0.02 7 13 23
35 . 0.04 7 12 22
36 100.00 0.06 6 12 22
----- ------ ------ -- -- --
Q1 Q2 Q3
-- -- --
all.Phred(B) 18 40 40
avg.Phred(R) 26 30 34
no.Errors(E) 0 1 2
-----------------------------
In the first part of the outputted table, for each value n
(varying from 0 to the largest observed HTS read length), the corresponding row indicates the percentage of HTS reads of length being equal to n
(column Lfreq
), the percentage of HTS reads with n
sequencing error(s) (column Efreq
), and the 1st, 2nd and 3rd quartiles of observed Phred scores at position n
(columns Q1
, Q2
and Q3
, respectively). The bottom part of the table summarizes the distribution of the Phred scores (first row all.Phred(B)
: three quartiles Q1
, Q2
and Q3
), the distribution of the average Phred score per HTS read (middle row avg.Phred(R)
: three quartiles Q1
, Q2
and Q3
), and the distribution of the (expected) number of sequencing error(s) per HTS read (last row no.Errors(E)
).
The above example therefore shows that the majority of Phred scores are decreasing below Q = 20 at positions 28-36 (i.e. the median Phred score Q2 is lower than 20 as of HTS read position 28). At least 25% of all sequenced bases are associated to Phred scores < 19 (i.e. first quartile Q1 = 18 in row B
), but at least 50% of the HTS reads have an average Phred score > 29 (median Q2 = 30 in row R
). However, at least 2 sequencing errors are expected within 25% of the HTS reads (third quartile Q3 = 2 in last row E
).
For more details (i.e. one supplementary column for each observed Phred score Q), a full table can be outputted using options -v f
:
fastq_info.sh -v f SRR001666_2.fastq.gz
##File: SRR001666_2.fastq.gz
#no.reads(NR): 7047668
#no.bases(NB): 253716048
#avg.lgt(AL): 36.0
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
n Lfreq Efreq Q1 Q2 Q3 Q= 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
----- ------ ------ -- -- -- ------ ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- -----
0 . 47.17 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1 . 26.28 40 40 40 0.1 0.0 0.0 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.2 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 94.3
2 . 12.42 40 40 40 0.1 0.0 0.0 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.2 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 94.2
3 . 5.97 40 40 40 0.2 0.1 0.0 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.2 0.1 0.1 0.1 0.1 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 92.3
4 . 2.98 40 40 40 0.2 0.1 0.0 0.1 0.1 0.2 0.1 0.1 0.1 0.1 0.2 0.1 0.1 0.1 0.1 0.1 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 92.3
5 . 1.64 40 40 40 0.2 0.1 0.0 0.1 0.1 0.2 0.1 0.1 0.1 0.1 0.3 0.1 0.1 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 91.5
6 . 1.00 40 40 40 0.2 0.1 0.0 0.1 0.1 0.2 0.1 0.1 0.1 0.1 0.3 0.1 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.4 0.3 0.4 0.4 0.4 0.4 0.4 90.6
7 . 0.63 40 40 40 0.2 0.1 0.0 0.1 0.2 0.2 0.1 0.1 0.1 0.2 0.3 0.2 0.2 0.2 0.2 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 87.4
8 . 0.45 40 40 40 0.2 0.1 0.0 0.1 0.2 0.2 0.1 0.1 0.1 0.2 0.4 0.2 0.2 0.2 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 86.8
9 . 0.32 40 40 40 0.2 0.1 0.0 0.1 0.2 0.3 0.2 0.2 0.2 0.2 0.5 0.3 0.3 0.3 0.3 0.4 0.4 0.4 0.4 0.4 0.5 0.5 0.5 0.5 0.5 0.5 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.7 0.7 0.7 0.7 0.7 0.7 0.7 82.7
10 . 0.25 40 40 40 0.2 0.1 0.1 0.1 0.3 0.3 0.2 0.2 0.2 0.2 0.6 0.3 0.3 0.4 0.4 0.4 0.4 0.5 0.5 0.5 0.5 0.5 0.5 0.6 0.6 0.6 0.6 0.6 0.6 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.8 0.8 80.6
11 . 0.17 40 40 40 0.3 0.1 0.1 0.2 0.3 0.3 0.2 0.2 0.2 0.3 0.6 0.3 0.4 0.4 0.4 0.5 0.5 0.5 0.5 0.6 0.6 0.6 0.6 0.6 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8 78.6
12 . 0.13 39 40 40 0.2 0.1 0.1 0.2 0.3 0.4 0.2 0.3 0.3 0.3 0.7 0.4 0.5 0.5 0.5 0.6 0.6 0.6 0.7 0.7 0.7 0.7 0.8 0.8 0.8 0.8 0.8 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9 1.0 0.9 1.0 1.0 74.4
13 . 0.11 35 40 40 0.2 0.2 0.1 0.2 0.3 0.4 0.2 0.3 0.3 0.4 0.8 0.5 0.5 0.6 0.6 0.7 0.7 0.8 0.8 0.8 0.8 0.9 0.9 0.9 1.0 1.0 1.0 1.0 1.0 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 70.1
14 . 0.09 33 40 40 0.2 0.2 0.1 0.2 0.4 0.5 0.3 0.3 0.3 0.4 0.9 0.5 0.6 0.6 0.7 0.7 0.8 0.8 0.8 0.9 0.9 1.0 1.0 1.0 1.0 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.2 1.1 1.2 1.1 1.2 1.1 1.1 1.1 67.8
15 . 0.06 31 40 40 0.2 0.2 0.1 0.2 0.4 0.5 0.3 0.3 0.4 0.5 1.1 0.6 0.7 0.7 0.8 0.8 0.9 0.9 1.0 1.0 1.0 1.1 1.1 1.1 1.1 1.2 1.2 1.2 1.2 1.2 1.2 1.2 1.2 1.2 1.2 1.2 1.2 1.2 1.2 1.2 64.9
16 . 0.05 28 40 40 0.2 0.3 0.1 0.3 0.5 0.6 0.4 0.5 0.5 0.6 1.3 0.8 0.8 0.9 0.9 1.0 1.1 1.1 1.1 1.2 1.2 1.2 1.3 1.3 1.3 1.3 1.3 1.3 1.3 1.4 1.3 1.3 1.3 1.3 1.3 1.3 1.3 1.3 1.2 1.2 59.9
17 . 0.03 25 40 40 0.2 0.3 0.2 0.3 0.6 0.8 0.5 0.6 0.6 0.7 1.6 0.9 1.0 1.1 1.1 1.2 1.2 1.3 1.3 1.4 1.4 1.4 1.4 1.5 1.5 1.5 1.4 1.5 1.5 1.5 1.4 1.5 1.4 1.4 1.4 1.4 1.4 1.3 1.3 1.3 54.9
18 . 0.02 24 40 40 0.4 0.3 0.2 0.4 0.6 0.8 0.5 0.6 0.7 0.7 1.7 1.0 1.1 1.1 1.2 1.3 1.3 1.4 1.4 1.5 1.5 1.5 1.6 1.5 1.5 1.6 1.6 1.6 1.5 1.6 1.6 1.5 1.5 1.5 1.4 1.4 1.4 1.4 1.3 1.3 52.2
19 . 0.01 22 39 40 0.2 0.3 0.2 0.4 0.7 0.9 0.6 0.7 0.7 0.9 2.0 1.1 1.2 1.3 1.4 1.4 1.5 1.5 1.6 1.6 1.6 1.7 1.6 1.7 1.6 1.7 1.7 1.6 1.6 1.6 1.6 1.6 1.5 1.5 1.5 1.4 1.4 1.4 1.3 1.3 48.8
20 . 0.01 20 35 40 0.2 0.4 0.2 0.5 0.9 1.1 0.7 0.8 0.9 1.0 2.4 1.3 1.4 1.5 1.6 1.6 1.7 1.7 1.8 1.8 1.8 1.8 1.8 1.8 1.8 1.8 1.8 1.8 1.7 1.7 1.6 1.6 1.6 1.5 1.5 1.4 1.4 1.4 1.3 1.3 43.9
21 . 0.01 19 33 40 0.3 0.4 0.3 0.6 1.0 1.3 0.8 0.9 1.0 1.1 2.6 1.5 1.6 1.6 1.8 1.8 1.8 1.9 1.9 1.9 1.9 1.9 1.9 1.9 1.9 1.8 1.8 1.8 1.8 1.7 1.7 1.6 1.6 1.5 1.5 1.4 1.4 1.4 1.3 1.3 40.8
22 . 0.01 18 31 40 0.2 0.5 0.3 0.7 1.1 1.4 0.9 1.0 1.2 1.3 3.0 1.7 1.7 1.8 1.9 2.0 2.0 2.0 2.1 2.0 2.1 2.1 2.0 2.0 2.0 1.9 1.9 1.9 1.8 1.8 1.7 1.6 1.6 1.5 1.5 1.4 1.4 1.3 1.3 1.2 37.3
23 . 0.01 16 28 40 0.3 0.6 0.4 0.8 1.2 1.7 1.0 1.2 1.3 1.5 3.4 1.9 2.0 2.1 2.1 2.2 2.2 2.3 2.3 2.2 2.2 2.2 2.1 2.1 2.1 2.0 1.9 1.8 1.8 1.7 1.7 1.6 1.6 1.5 1.4 1.4 1.3 1.3 1.2 1.2 33.3
24 . 0.01 15 26 40 0.3 0.7 0.4 0.9 1.4 1.9 1.2 1.4 1.6 1.7 3.9 2.1 2.2 2.3 2.4 2.4 2.4 2.4 2.4 2.4 2.3 2.3 2.2 2.2 2.1 2.1 1.9 1.9 1.8 1.7 1.6 1.6 1.5 1.5 1.4 1.3 1.2 1.2 1.2 1.1 29.5
25 . 0.01 13 24 40 0.2 0.8 0.5 1.0 1.7 2.3 1.4 1.6 1.8 2.0 4.4 2.4 2.5 2.6 2.6 2.6 2.6 2.6 2.5 2.5 2.4 2.3 2.2 2.2 2.1 2.0 1.9 1.8 1.7 1.7 1.6 1.5 1.4 1.4 1.3 1.2 1.2 1.1 1.1 1.0 26.0
26 . 0.01 12 22 38 0.2 1.0 0.6 1.2 2.0 2.6 1.6 1.8 2.0 2.2 4.9 2.6 2.7 2.7 2.7 2.7 2.7 2.7 2.6 2.5 2.4 2.4 2.2 2.2 2.1 2.0 1.9 1.8 1.7 1.6 1.5 1.5 1.4 1.3 1.2 1.2 1.1 1.1 1.0 1.0 23.4
27 . 0.01 12 21 36 0.2 1.2 0.7 1.4 2.2 2.9 1.8 2.0 2.2 2.4 5.2 2.8 2.8 2.9 2.8 2.8 2.8 2.7 2.7 2.6 2.4 2.3 2.3 2.1 2.0 2.0 1.8 1.7 1.6 1.6 1.5 1.4 1.3 1.2 1.2 1.1 1.1 1.0 1.0 0.9 21.3
28 . 0.01 11 19 33 0.2 1.3 0.8 1.6 2.5 3.3 2.0 2.2 2.4 2.6 5.7 3.0 3.0 3.0 3.0 2.9 2.8 2.7 2.7 2.5 2.4 2.3 2.2 2.1 2.0 1.9 1.8 1.7 1.6 1.5 1.4 1.3 1.2 1.2 1.1 1.0 1.0 0.9 0.9 0.8 19.1
29 . 0.01 10 18 32 0.4 1.6 0.9 1.8 2.8 3.6 2.2 2.4 2.6 2.8 6.0 3.1 3.1 3.1 3.1 2.9 2.9 2.8 2.7 2.6 2.4 2.3 2.2 2.1 1.9 1.8 1.7 1.6 1.5 1.4 1.3 1.3 1.2 1.1 1.0 1.0 0.9 0.9 0.8 0.8 17.3
30 . 0.01 10 17 29 0.3 1.8 1.1 2.1 3.2 4.1 2.5 2.7 2.8 3.0 6.4 3.3 3.3 3.2 3.1 3.0 2.9 2.8 2.7 2.5 2.4 2.2 2.1 2.0 1.8 1.8 1.6 1.5 1.4 1.4 1.2 1.2 1.1 1.0 1.0 0.9 0.8 0.8 0.7 0.7 15.4
31 . 0.01 9 16 28 0.2 2.1 1.3 2.4 3.5 4.6 2.6 2.9 3.1 3.2 6.8 3.4 3.4 3.3 3.2 3.1 2.9 2.8 2.6 2.5 2.3 2.2 2.0 1.9 1.8 1.7 1.5 1.4 1.4 1.3 1.2 1.1 1.0 1.0 0.9 0.8 0.8 0.7 0.7 0.6 13.8
32 . 0.01 9 15 26 0.3 2.5 1.4 2.6 3.9 4.9 2.9 3.1 3.4 3.4 7.1 3.5 3.4 3.3 3.2 3.1 2.9 2.8 2.6 2.4 2.2 2.1 2.0 1.8 1.7 1.6 1.4 1.4 1.3 1.2 1.1 1.0 1.0 0.9 0.8 0.8 0.7 0.7 0.6 0.6 12.7
33 . 0.01 8 14 24 0.2 2.9 1.7 3.0 4.3 5.5 3.1 3.4 3.6 3.7 7.3 3.6 3.5 3.4 3.2 3.0 2.8 2.7 2.5 2.3 2.2 2.0 1.9 1.7 1.6 1.5 1.4 1.3 1.2 1.1 1.0 0.9 0.9 0.8 0.7 0.7 0.7 0.6 0.6 0.5 11.1
34 . 0.02 7 13 23 0.2 3.3 1.9 3.3 4.8 5.9 3.4 3.6 3.8 3.9 7.7 3.7 3.6 3.4 3.2 3.0 2.8 2.6 2.4 2.2 2.1 1.9 1.7 1.6 1.5 1.4 1.2 1.2 1.1 1.0 0.9 0.9 0.8 0.7 0.7 0.7 0.6 0.6 0.5 0.5 9.7
35 . 0.04 7 12 22 0.3 3.9 2.1 3.7 5.2 6.4 3.5 3.8 3.9 4.0 7.8 3.7 3.5 3.4 3.2 3.0 2.7 2.5 2.3 2.1 2.0 1.8 1.7 1.5 1.4 1.3 1.2 1.1 1.0 1.0 0.9 0.8 0.7 0.7 0.6 0.6 0.6 0.5 0.5 0.5 8.8
36 100.00 0.06 6 12 22 0.2 4.1 2.4 4.0 5.5 6.7 3.6 3.8 3.9 3.9 7.6 3.6 3.5 3.3 3.0 2.8 2.6 2.4 2.2 2.1 1.9 1.8 1.6 1.5 1.4 1.3 1.2 1.1 1.0 0.9 0.8 0.8 0.7 0.7 0.6 0.6 0.5 0.5 0.5 0.4 9.0
----- ------ ------ -- -- -- ------ ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- -----
Q1 Q2 Q3 Q= 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
-- -- -- ------ ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- -----
all.Phred(B) 18 40 40 0.2 0.9 0.5 1.0 1.5 1.9 1.1 1.2 1.3 1.4 2.9 1.5 1.6 1.6 1.6 1.5 1.5 1.5 1.5 1.4 1.4 1.4 1.3 1.3 1.2 1.2 1.2 1.1 1.1 1.1 1.0 1.0 1.0 1.0 0.9 0.9 0.9 0.9 0.8 0.8 51.0
avg.Phred(R) 26 30 34 0.1 0.0 0.0 0.0 0.1 0.1 0.1 0.1 0.2 0.2 0.2 0.2 0.3 0.3 0.4 0.4 0.5 0.6 0.8 1.0 1.3 1.6 2.2 2.7 3.5 4.2 5.1 5.8 6.5 7.0 7.3 7.3 7.1 6.7 6.2 5.5 4.8 3.9 2.9 1.9 0.7
no.Errors(E) 0 1 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
#subsampling.rate: 0.111
To help with reading, the main Phred-derived statistics for all files can be summarized in tab-delimited format using option -v t
:
fastq_info.sh -v t SRR001666*.fastq.gz
#File NR NB AL BQ1 BQ2 BQ3 RQ1 RQ2 RQ3 EQ1 EQ2 EQ3
SRR001666_1.fastq.gz 7047668 253716048 36.0 30 40 40 32 35 37 0 0 0
SRR001666_2.fastq.gz 7047668 253716048 36.0 18 40 40 26 30 34 0 1 2
This simple output format enables to easily read every file name (#File
), no. HTS reads (NR
) and bases (NB
), average HTS read length (AL
), as well as the three quartiles of the three Phred-related distributions, i.e. Phred score per base (BQ1
, BQ2
, BQ3
), average Phred score per HTS read (RQ1
, RQ2
, RQ3
), and expected number of sequencing error(s) per HTS read (EQ1
, EQ2
, EQ3
).
The above example clearly shows that the overall sequencing error rate is lower in file SRR001666_1.fastq.gz than in file SRR001666_2.fastq.gz, therefore leading to many more HTS reads without sequencing error in the former FASTQ file.
By default, all distributions are estimated from a subset of all input FASTQ blocks to obtain fast running times (the subsampling rate is indicated when using option -v f
). In almost all cases, default subsampling rate (i.e. ~10% with option -s 5
) is sufficient to efficiently approximate the different distributions.
For example, the below command line uses all FASTQ blocks from each input file (i.e. option -s 1
):
fastq_info.sh -s 1 -v t SRR001666*.fastq.gz
#File NR NB AL BQ1 BQ2 BQ3 RQ1 RQ2 RQ3 EQ1 EQ2 EQ3
SRR001666_1.fastq.gz 7047668 253716048 36.0 30 40 40 32 35 37 0 0 0
SRR001666_2.fastq.gz 7047668 253716048 36.0 18 40 40 26 30 34 0 1 2
All statistics are identical to the ones previously estimated (see above), but the overall running time was 8 times slower… For comparison, when used with default options, fastq_info is expected to run twice faster than FastQC to process one FASTQ file.
Bonfield JK, Mahoney MV (2013) Compression of FASTQ and SAM format sequencing data. PLOS One, 8(3):e59190. doi:10.1371/journal.pone.0059190.
Edgar RC, Flyvbjerg H (2015) Error filtering, pair assembly and error correction for next-generation sequencing reads. Bioinformatics, 31(21):3476-3482. doi:10.1093/bioinformatics/btv401.
Ewing D, Green P (1998) Base-Calling of Automated Sequencer Traces Using Phred. II. Error Probabilities. Genome Research, 8:186-194. doi:10.1101/gr.8.3.186.
Jones DC, Ruzzo WL, Peng X, Katze MG (2012) Compression of next-generation sequencing reads aided by highly efficient de novo assembly. Nucleic Acids Research, 40(22):e171–e171. doi:10.1093/nar/gks754.
Roguski L, Deorowicz S (2014) DSRC 2 - Industry-oriented compression of FASTQ files. Bioinformatics, 30(15):2213-2215. doi:10.1093/bioinformatics/btu208.