fastq_info

fastq_info is a command line program written in Bash/AWK for quickly estimating several standard descriptive statistics from FASTQ-formatted High-Throughput Sequencing (HTS) files. Estimated statistics per FASTQ file are:

  ▹   HTS read and base numbers,

  ▹   HTS read length distribution,

  ▹   GC-content per HTS read position,

  ▹   Phred score distribution (global and for each HTS read position),

  ▹   average Phred score per HTS read distribution.

Several output result formats are available (e.g. reduced/full table, tab-delimited).

Installation and execution

Clone this repository with the following command line:

git clone https://gitlab.pasteur.fr/GIPhy/fastq_info.git

Go to the directory fastq_info/ to give the execute permission to the file fastq_info.sh:

chmod +x fastq_info.sh

and run it with the following command line model:

./fastq_info.sh [options]
About AWK

fastq_info requires an AWK interpreter in the $PATH, which is always the case for most Linux distributions. By default, fastq_info first considers gawk (GNU awk, generally available on recent Linux distributions); otherwise the basic command awk in the $PATH is used. However, alternative implementations of AWK can be specified using option -a (see Usage section). In practice, fastq_info is able to detect and deal with most AWK interpreters (e.g. nawk, mawk, goawk).

About compressed FASTQ files

To use fastq_info with standard FASTQ compression formats, it is expected that the following binaries are available in the $PATH:

+gzip, required to deal with files compressed using gzip;

+bzip2, required to deal with files compressed using bzip2;

+pigz, expected to deal with files compressed using gzip on multiple threads (when not installed, gzip is used instead);

+pbzip2, expected to deal with files compressed using bzip2 on multiple threads (when not installed, bzip2 is used instead);

+dsrc, required to deal with files compressed using DSRC 2.0 RC/RC2 (Roguski and Deorowicz 2014);

+fqzcomp, required to deal with files compressed using fqzcomp 4.0 (Bonfield and Mahoney 2013);

+quip, required to deal with files compressed using QUIP (Jones et al. 2012).

To run fastq_info, it is not required to install all these binaries, but the dedicated tool(s) should be available depending on the compression format of the input files.

Usage

Run fastq_info without option to read the following documentation:

 USAGE:  fastq_info.sh  [options]  [<file1> <file2> ...] 

 Allowed file extensions (case insensitive):
  .bz
  .bz2
  .bzip
  .bzip2 ... considered as FASTQ-formatted files compressed using bzip2;
             decompressed  using  bunzip2  or pbzip2  (when available in 
             $PATH)
  .dsrc
  .dsrc2 ... considered as  FASTQ-formatted  files compressed using DSRC 
             v2.0 (sun.aei.polsl.pl/dsrc);  decompressed using DSRC v2.0
             (when available in $PATH)
  .fastq
  .fq
  .txt ..... considered as uncompressed FASTQ-formatted files

  .fqz ..... considered  as   FASTQ-formatted  files   compressed  using 
             fqzcomp  v4  (github.com/jkbonfield/fqzcomp);  decompressed 
             using fqzcomp v4 (when available in $PATH)
  .gz
  .gzip .... considered as FASTQ-formatted files  compressed using gzip;
             decompressed using gunzip or pigz (when available in $PATH)

  .qp ...... considered as  FASTQ-formatted files  compressed using QUIP
             (github.com/dcjones/quip);  decompressed  using QUIP  (when 
             available in $PATH)

 Options:
  -s <int>   speed index between 1 (slower) and 9 (faster) to manage the 
             subsampling rate;  when set to 1, 2, 3, 4 or 5, results are
             based on 100%, ~33%, ~20%, ~15% or ~10% of all FASTQ blocks
             (default: 5)
  -v <char>  reduced (r),  full (f) or  tab-delimited (t)  result output
             (default: r)
  -p <int>   Phred quality offset (default: 33)
  -d         DOS end-of-lines in input file(s) (default: not set)
  -t <int>   number of thread(s) for decompressing files (default: 1)
  -a         AWK interpreter (default: gawk or awk in $PATH)
  -c         checks available tools (default: not set)
  -h         prints this help and exits

Notes

Examples

The following Bash command line enables to download the pair of gzipped FASTQ files SRR001666_1.fastq.gz and SRR001666_2.fastq.gz to be used as examples:

wget -q ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR001/SRR001666/SRR001666*.fastq.gz

Basic usage

The following command line runs fastq_info.sh to analyze the second (i.e. R2) downloaded file :

./fastq_info.sh  SRR001666_2.fastq.gz

leading to the following standard output:

##File: SRR001666_2.fastq.gz
#no.reads: 7047668
#no.bases: 253716048
#avg.lgt:  36.0
------------------------------
pos    Lfreq GCfreq   Q1 Q2 Q3
----- ------ ------   -- -- --
1          .  47.61   40 40 40
2          .  47.18   40 40 40
3          .  49.13   40 40 40
4          .  50.43   40 40 40
5          .  49.95   40 40 40
6          .  51.04   40 40 40
7          .  50.40   40 40 40
8          .  50.22   40 40 40
9          .  50.63   40 40 40
10         .  50.74   40 40 40
11         .  50.72   40 40 40
12         .  51.04   39 40 40
13         .  50.77   35 40 40
14         .  50.69   33 40 40
15         .  51.01   31 40 40
16         .  51.08   28 40 40
17         .  51.07   25 40 40
18         .  51.44   24 40 40
19         .  51.40   22 39 40
20         .  51.25   20 35 40
21         .  51.52   19 33 40
22         .  51.46   18 31 40
23         .  51.24   16 28 40
24         .  51.53   15 26 40
25         .  51.55   13 24 40
26         .  51.46   12 22 38
27         .  51.75   12 21 36
28         .  51.72   11 19 33
29         .  51.45   10 18 32
30         .  51.85   10 17 29
31         .  51.43    9 16 28
32         .  51.36    9 15 26
33         .  51.74    8 14 24
34         .  51.73    7 13 23
35         .  51.63    7 12 22
36    100.00  51.70    6 12 22
----- ------ ------   -- -- --
                      Q1 Q2 Q3
                      -- -- --
bases                 18 40 40
reads                 26 30 34
------------------------------

The first part of the outputted table is made up by one row per HTS read position (column pos). For each pos value (varying from 1 to the largest observed HTS read length), the corresponding row indicates the percentage of HTS read of length being equal to pos (column Lfreq), the percentage of observed GC bases (column GCfreq), and the 1st, 2nd and 3rd quartiles of observed Phred scores (columns Q1, Q2 and Q3, respectively). The bottom part of the table summarizes the global Phred score distribution (row bases: three quartiles Q1, Q2 and Q3), and the average Phred score per HTS read distribution (last row reads: three quartiles Q1, Q2 and Q3).

The above example therefore shows that the majority of Phred scores are decreasing below Q = 20 at positions 28-36 (i.e. the median Phred score Q2 is lower than 20 as of HTS read position 28). At least 25% of all sequenced bases are associated to Phred scores < 19 (i.e. first quartile Q1 = 18 in row bases), but at least 50% of the HTS reads have an average Phred score > 29 (median Q2 = 30 in row reads)

Advanced usage

For more details (i.e. one supplementary column for each observed Phred score), a full table can be outputted using option -v f:

./fastq_info.sh  -v f  SRR001666_2.fastq.gz
##File: SRR001666_2.fastq.gz
#no.reads: 7047668
#no.bases: 253716048
#avg.lgt:  36.0
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
pos    Lfreq GCfreq   Q1 Q2 Q3 Q=   0     1     2     3     4     5     6     7     8     9    10    11    12    13    14    15    16    17    18    19    20    21    22    23    24    25    26    27    28    29    30    31    32    33    34    35    36    37    38    39    40
----- ------ ------   -- -- -- ------ ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- -----
1          .  47.61   40 40 40    0.1   0.0   0.0   0.1   0.1   0.1   0.1   0.1   0.1   0.1   0.2   0.1   0.1   0.1   0.1   0.1   0.1   0.1   0.1   0.1   0.1   0.2   0.2   0.2   0.2   0.2   0.2   0.2   0.2   0.2   0.2   0.2   0.2   0.2   0.2   0.2   0.2   0.2   0.2   0.2  94.3
2          .  47.18   40 40 40    0.1   0.0   0.0   0.1   0.1   0.1   0.1   0.1   0.1   0.1   0.2   0.1   0.1   0.1   0.1   0.1   0.1   0.1   0.1   0.1   0.1   0.2   0.2   0.2   0.2   0.2   0.2   0.2   0.2   0.2   0.2   0.2   0.2   0.2   0.2   0.2   0.2   0.2   0.2   0.2  94.2
3          .  49.13   40 40 40    0.2   0.1   0.0   0.1   0.1   0.1   0.1   0.1   0.1   0.1   0.2   0.1   0.1   0.1   0.1   0.2   0.2   0.2   0.2   0.2   0.2   0.2   0.2   0.2   0.2   0.2   0.2   0.2   0.2   0.3   0.3   0.3   0.3   0.3   0.3   0.3   0.3   0.3   0.3   0.3  92.3
4          .  50.43   40 40 40    0.2   0.1   0.0   0.1   0.1   0.2   0.1   0.1   0.1   0.1   0.2   0.1   0.1   0.1   0.1   0.1   0.2   0.2   0.2   0.2   0.2   0.2   0.2   0.2   0.2   0.2   0.2   0.2   0.2   0.2   0.3   0.3   0.3   0.3   0.3   0.3   0.3   0.3   0.3   0.3  92.3
5          .  49.95   40 40 40    0.2   0.1   0.0   0.1   0.1   0.2   0.1   0.1   0.1   0.1   0.3   0.1   0.1   0.2   0.2   0.2   0.2   0.2   0.2   0.2   0.2   0.2   0.2   0.2   0.2   0.3   0.3   0.3   0.3   0.3   0.3   0.3   0.3   0.3   0.3   0.3   0.3   0.3   0.3   0.3  91.5
6          .  51.04   40 40 40    0.2   0.1   0.0   0.1   0.1   0.2   0.1   0.1   0.1   0.1   0.3   0.1   0.2   0.2   0.2   0.2   0.2   0.2   0.2   0.2   0.2   0.2   0.3   0.3   0.3   0.3   0.3   0.3   0.3   0.3   0.3   0.3   0.3   0.4   0.3   0.4   0.4   0.4   0.4   0.4  90.6
7          .  50.40   40 40 40    0.2   0.1   0.0   0.1   0.2   0.2   0.1   0.1   0.1   0.2   0.3   0.2   0.2   0.2   0.2   0.3   0.3   0.3   0.3   0.3   0.3   0.3   0.3   0.4   0.4   0.4   0.4   0.4   0.4   0.4   0.4   0.4   0.5   0.5   0.5   0.5   0.5   0.5   0.5   0.5  87.4
8          .  50.22   40 40 40    0.2   0.1   0.0   0.1   0.2   0.2   0.1   0.1   0.1   0.2   0.4   0.2   0.2   0.2   0.3   0.3   0.3   0.3   0.3   0.3   0.3   0.3   0.4   0.4   0.4   0.4   0.4   0.4   0.4   0.4   0.5   0.5   0.5   0.5   0.5   0.5   0.5   0.5   0.5   0.5  86.8
9          .  50.63   40 40 40    0.2   0.1   0.0   0.1   0.2   0.3   0.2   0.2   0.2   0.2   0.5   0.3   0.3   0.3   0.3   0.4   0.4   0.4   0.4   0.4   0.5   0.5   0.5   0.5   0.5   0.5   0.6   0.6   0.6   0.6   0.6   0.6   0.6   0.7   0.7   0.7   0.7   0.7   0.7   0.7  82.7
10         .  50.74   40 40 40    0.2   0.1   0.1   0.1   0.3   0.3   0.2   0.2   0.2   0.2   0.6   0.3   0.3   0.4   0.4   0.4   0.4   0.5   0.5   0.5   0.5   0.5   0.5   0.6   0.6   0.6   0.6   0.6   0.6   0.7   0.7   0.7   0.7   0.7   0.7   0.7   0.7   0.7   0.8   0.8  80.6
11         .  50.72   40 40 40    0.3   0.1   0.1   0.2   0.3   0.3   0.2   0.2   0.2   0.3   0.6   0.3   0.4   0.4   0.4   0.5   0.5   0.5   0.5   0.6   0.6   0.6   0.6   0.6   0.7   0.7   0.7   0.7   0.7   0.7   0.7   0.8   0.8   0.8   0.8   0.8   0.8   0.8   0.8   0.8  78.6
12         .  51.04   39 40 40    0.2   0.1   0.1   0.2   0.3   0.4   0.2   0.3   0.3   0.3   0.7   0.4   0.5   0.5   0.5   0.6   0.6   0.6   0.7   0.7   0.7   0.7   0.8   0.8   0.8   0.8   0.8   0.9   0.9   0.9   0.9   0.9   0.9   0.9   0.9   0.9   1.0   0.9   1.0   1.0  74.4
13         .  50.77   35 40 40    0.2   0.2   0.1   0.2   0.3   0.4   0.2   0.3   0.3   0.4   0.8   0.5   0.5   0.6   0.6   0.7   0.7   0.8   0.8   0.8   0.8   0.9   0.9   0.9   1.0   1.0   1.0   1.0   1.0   1.1   1.1   1.1   1.1   1.1   1.1   1.1   1.1   1.1   1.1   1.1  70.1
14         .  50.69   33 40 40    0.2   0.2   0.1   0.2   0.4   0.5   0.3   0.3   0.3   0.4   0.9   0.5   0.6   0.6   0.7   0.7   0.8   0.8   0.8   0.9   0.9   1.0   1.0   1.0   1.0   1.1   1.1   1.1   1.1   1.1   1.1   1.1   1.2   1.1   1.2   1.1   1.2   1.1   1.1   1.1  67.8
15         .  51.01   31 40 40    0.2   0.2   0.1   0.2   0.4   0.5   0.3   0.3   0.4   0.5   1.1   0.6   0.7   0.7   0.8   0.8   0.9   0.9   1.0   1.0   1.0   1.1   1.1   1.1   1.1   1.2   1.2   1.2   1.2   1.2   1.2   1.2   1.2   1.2   1.2   1.2   1.2   1.2   1.2   1.2  64.9
16         .  51.08   28 40 40    0.2   0.3   0.1   0.3   0.5   0.6   0.4   0.5   0.5   0.6   1.3   0.8   0.8   0.9   0.9   1.0   1.1   1.1   1.1   1.2   1.2   1.2   1.3   1.3   1.3   1.3   1.3   1.3   1.3   1.4   1.3   1.3   1.3   1.3   1.3   1.3   1.3   1.3   1.2   1.2  59.9
17         .  51.07   25 40 40    0.2   0.3   0.2   0.3   0.6   0.8   0.5   0.6   0.6   0.7   1.6   0.9   1.0   1.1   1.1   1.2   1.2   1.3   1.3   1.4   1.4   1.4   1.4   1.5   1.5   1.5   1.4   1.5   1.5   1.5   1.4   1.5   1.4   1.4   1.4   1.4   1.4   1.3   1.3   1.3  54.9
18         .  51.44   24 40 40    0.4   0.3   0.2   0.4   0.6   0.8   0.5   0.6   0.7   0.7   1.7   1.0   1.1   1.1   1.2   1.3   1.3   1.4   1.4   1.5   1.5   1.5   1.6   1.5   1.5   1.6   1.6   1.6   1.5   1.6   1.6   1.5   1.5   1.5   1.4   1.4   1.4   1.4   1.3   1.3  52.2
19         .  51.40   22 39 40    0.2   0.3   0.2   0.4   0.7   0.9   0.6   0.7   0.7   0.9   2.0   1.1   1.2   1.3   1.4   1.4   1.5   1.5   1.6   1.6   1.6   1.7   1.6   1.7   1.6   1.7   1.7   1.6   1.6   1.6   1.6   1.6   1.5   1.5   1.5   1.4   1.4   1.4   1.3   1.3  48.8
20         .  51.25   20 35 40    0.2   0.4   0.2   0.5   0.9   1.1   0.7   0.8   0.9   1.0   2.4   1.3   1.4   1.5   1.6   1.6   1.7   1.7   1.8   1.8   1.8   1.8   1.8   1.8   1.8   1.8   1.8   1.8   1.7   1.7   1.6   1.6   1.6   1.5   1.5   1.4   1.4   1.4   1.3   1.3  43.9
21         .  51.52   19 33 40    0.3   0.4   0.3   0.6   1.0   1.3   0.8   0.9   1.0   1.1   2.6   1.5   1.6   1.6   1.8   1.8   1.8   1.9   1.9   1.9   1.9   1.9   1.9   1.9   1.9   1.8   1.8   1.8   1.8   1.7   1.7   1.6   1.6   1.5   1.5   1.4   1.4   1.4   1.3   1.3  40.8
22         .  51.46   18 31 40    0.2   0.5   0.3   0.7   1.1   1.4   0.9   1.0   1.2   1.3   3.0   1.7   1.7   1.8   1.9   2.0   2.0   2.0   2.1   2.0   2.1   2.1   2.0   2.0   2.0   1.9   1.9   1.9   1.8   1.8   1.7   1.6   1.6   1.5   1.5   1.4   1.4   1.3   1.3   1.2  37.3
23         .  51.24   16 28 40    0.3   0.6   0.4   0.8   1.2   1.7   1.0   1.2   1.3   1.5   3.4   1.9   2.0   2.1   2.1   2.2   2.2   2.3   2.3   2.2   2.2   2.2   2.1   2.1   2.1   2.0   1.9   1.8   1.8   1.7   1.7   1.6   1.6   1.5   1.4   1.4   1.3   1.3   1.2   1.2  33.3
24         .  51.53   15 26 40    0.3   0.7   0.4   0.9   1.4   1.9   1.2   1.4   1.6   1.7   3.9   2.1   2.2   2.3   2.4   2.4   2.4   2.4   2.4   2.4   2.3   2.3   2.2   2.2   2.1   2.1   1.9   1.9   1.8   1.7   1.6   1.6   1.5   1.5   1.4   1.3   1.2   1.2   1.2   1.1  29.5
25         .  51.55   13 24 40    0.2   0.8   0.5   1.0   1.7   2.3   1.4   1.6   1.8   2.0   4.4   2.4   2.5   2.6   2.6   2.6   2.6   2.6   2.5   2.5   2.4   2.3   2.2   2.2   2.1   2.0   1.9   1.8   1.7   1.7   1.6   1.5   1.4   1.4   1.3   1.2   1.2   1.1   1.1   1.0  26.0
26         .  51.46   12 22 38    0.2   1.0   0.6   1.2   2.0   2.6   1.6   1.8   2.0   2.2   4.9   2.6   2.7   2.7   2.7   2.7   2.7   2.7   2.6   2.5   2.4   2.4   2.2   2.2   2.1   2.0   1.9   1.8   1.7   1.6   1.5   1.5   1.4   1.3   1.2   1.2   1.1   1.1   1.0   1.0  23.4
27         .  51.75   12 21 36    0.2   1.2   0.7   1.4   2.2   2.9   1.8   2.0   2.2   2.4   5.2   2.8   2.8   2.9   2.8   2.8   2.8   2.7   2.7   2.6   2.4   2.3   2.3   2.1   2.0   2.0   1.8   1.7   1.6   1.6   1.5   1.4   1.3   1.2   1.2   1.1   1.1   1.0   1.0   0.9  21.3
28         .  51.72   11 19 33    0.2   1.3   0.8   1.6   2.5   3.3   2.0   2.2   2.4   2.6   5.7   3.0   3.0   3.0   3.0   2.9   2.8   2.7   2.7   2.5   2.4   2.3   2.2   2.1   2.0   1.9   1.8   1.7   1.6   1.5   1.4   1.3   1.2   1.2   1.1   1.0   1.0   0.9   0.9   0.8  19.1
29         .  51.45   10 18 32    0.4   1.6   0.9   1.8   2.8   3.6   2.2   2.4   2.6   2.8   6.0   3.1   3.1   3.1   3.1   2.9   2.9   2.8   2.7   2.6   2.4   2.3   2.2   2.1   1.9   1.8   1.7   1.6   1.5   1.4   1.3   1.3   1.2   1.1   1.0   1.0   0.9   0.9   0.8   0.8  17.3
30         .  51.85   10 17 29    0.3   1.8   1.1   2.1   3.2   4.1   2.5   2.7   2.8   3.0   6.4   3.3   3.3   3.2   3.1   3.0   2.9   2.8   2.7   2.5   2.4   2.2   2.1   2.0   1.8   1.8   1.6   1.5   1.4   1.4   1.2   1.2   1.1   1.0   1.0   0.9   0.8   0.8   0.7   0.7  15.4
31         .  51.43    9 16 28    0.2   2.1   1.3   2.4   3.5   4.6   2.6   2.9   3.1   3.2   6.8   3.4   3.4   3.3   3.2   3.1   2.9   2.8   2.6   2.5   2.3   2.2   2.0   1.9   1.8   1.7   1.5   1.4   1.4   1.3   1.2   1.1   1.0   1.0   0.9   0.8   0.8   0.7   0.7   0.6  13.8
32         .  51.36    9 15 26    0.3   2.5   1.4   2.6   3.9   4.9   2.9   3.1   3.4   3.4   7.1   3.5   3.4   3.3   3.2   3.1   2.9   2.8   2.6   2.4   2.2   2.1   2.0   1.8   1.7   1.6   1.4   1.4   1.3   1.2   1.1   1.0   1.0   0.9   0.8   0.8   0.7   0.7   0.6   0.6  12.7
33         .  51.74    8 14 24    0.2   2.9   1.7   3.0   4.3   5.5   3.1   3.4   3.6   3.7   7.3   3.6   3.5   3.4   3.2   3.0   2.8   2.7   2.5   2.3   2.2   2.0   1.9   1.7   1.6   1.5   1.4   1.3   1.2   1.1   1.0   0.9   0.9   0.8   0.7   0.7   0.7   0.6   0.6   0.5  11.1
34         .  51.73    7 13 23    0.2   3.3   1.9   3.3   4.8   5.9   3.4   3.6   3.8   3.9   7.7   3.7   3.6   3.4   3.2   3.0   2.8   2.6   2.4   2.2   2.1   1.9   1.7   1.6   1.5   1.4   1.2   1.2   1.1   1.0   0.9   0.9   0.8   0.7   0.7   0.7   0.6   0.6   0.5   0.5   9.7
35         .  51.63    7 12 22    0.3   3.9   2.1   3.7   5.2   6.4   3.5   3.8   3.9   4.0   7.8   3.7   3.5   3.4   3.2   3.0   2.7   2.5   2.3   2.1   2.0   1.8   1.7   1.5   1.4   1.3   1.2   1.1   1.0   1.0   0.9   0.8   0.7   0.7   0.6   0.6   0.6   0.5   0.5   0.5   8.8
36    100.00  51.70    6 12 22    0.2   4.1   2.4   4.0   5.5   6.7   3.6   3.8   3.9   3.9   7.6   3.6   3.5   3.3   3.0   2.8   2.6   2.4   2.2   2.1   1.9   1.8   1.6   1.5   1.4   1.3   1.2   1.1   1.0   0.9   0.8   0.8   0.7   0.7   0.6   0.6   0.5   0.5   0.5   0.4   9.0
----- ------ ------   -- -- -- ------ ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- -----
                      Q1 Q2 Q3 Q=   0     1     2     3     4     5     6     7     8     9    10    11    12    13    14    15    16    17    18    19    20    21    22    23    24    25    26    27    28    29    30    31    32    33    34    35    36    37    38    39    40
                      -- -- -- ------ ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- -----
bases                 18 40 40    0.2   0.9   0.5   1.0   1.5   1.9   1.1   1.2   1.3   1.4   2.9   1.5   1.6   1.6   1.6   1.5   1.5   1.5   1.5   1.4   1.4   1.4   1.3   1.3   1.2   1.2   1.2   1.1   1.1   1.1   1.0   1.0   1.0   1.0   0.9   0.9   0.9   0.9   0.8   0.8  51.0
reads                 26 30 34    0.1   0.0   0.0   0.0   0.1   0.1   0.1   0.1   0.2   0.2   0.2   0.2   0.3   0.3   0.4   0.4   0.5   0.6   0.8   1.0   1.3   1.6   2.2   2.7   3.5   4.2   5.1   5.8   6.5   7.0   7.3   7.3   7.1   6.7   6.2   5.5   4.8   3.9   2.9   1.9   0.7
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
#subsampling.rate: 0.111

Tab-delimited outputs

To help with reading, the main statistics for all files can be summarized in tab-delimited format using option -v t:

./fastq_info.sh  -v t  SRR001666*.fastq.gz
#File                NR      NB        AL    BQ1 BQ2 BQ3  RQ1 RQ2 RQ3
SRR001666_1.fastq.gz 7047668 253716048 36.0  30  40  40   32  35  37 
SRR001666_2.fastq.gz 7047668 253716048 36.0  18  40  40   26  30  34 

This simple output format enables to easily read every file name (#File), no. HTS reads (NR) and bases (NB), average HTS read length (AL), as well as the three quartiles of the global Phred score distribution (BQ1, BQ2, BQ3) and of the average Phred score per HTS read distribution (RQ1, RQ2, RQ3).

The above example clearly shows that the overall sequencing error rate is lower in file SRR001666_1.fastq.gz than in file SRR001666_2.fastq.gz.

Note on the subsampling rate

By default, all distributions are estimated from a subset of all input FASTQ blocks to obtain fast running times (the subsampling rate is indicated when using option -v f). In almost all cases, default subsampling rate (i.e. ~10% with option -s 5) is sufficient to efficiently approximate the different distributions (i.e. HTS read lengths, GC-content, Phred scores).

For example, the below command line uses all FASTQ blocks from each input file (i.e. option -s 1):

./fastq_info.sh  -s 1  -v t  SRR001666*.fastq.gz
#File                NR      NB        AL    BQ1 BQ2 BQ3  RQ1 RQ2 RQ3
SRR001666_1.fastq.gz 7047668 253716048 36.0  30  40  40   32  35  37 
SRR001666_2.fastq.gz 7047668 253716048 36.0  18  40  40   26  30  34 

All statistics are identical to the ones previously estimated (see above), but the overall running time was 8 times slower… For comparison, when used with default options, fastq_info is expected to run ~1.5 times faster than FastQC to process one FASTQ file.

References

Bonfield JK, Mahoney MV (2013) Compression of FASTQ and SAM format sequencing data. PLOS One, 8(3):e59190. doi:10.1371/journal.pone.0059190.

Jones DC, Ruzzo WL, Peng X, Katze MG (2012) Compression of next-generation sequencing reads aided by highly efficient de novo assembly. Nucleic Acids Research, 40(22):e171–e171. doi:10.1093/nar/gks754.

Roguski L, Deorowicz S (2014) DSRC 2 - Industry-oriented compression of FASTQ files. Bioinformatics, 30(15):2213-2215. doi:10.1093/bioinformatics/btu208.