GPLv3 license JAVA

CoPro

CoPro is a command line program written in Java to infer the coverage profile of a genome assembly (Lindner et al 2013). It analyses the sequencing reads aligned against their own assembly in order to assess whether the derived histogram over all observed per-base coverages is expectedly unimodal or not.

CoPro returns different results:
  ▹   average read coverage per contig,
  ▹   coverage detail of every assembled base,
  ▹   monomodality (momo) index of the genome coverage profile,
  ▹   fitted parameters of the associated generalized Poisson (GP) distribution,
  ▹   R2 fitting index,
  ▹   coverage depth confidence interval.

Of note, both momo and R2 indices can be used as quality control metrics to assess the overall accuracy of a genome assembly.

In complement, CoPro enhances the genome assembly from its coverage profile:
  ▹   significantly under-covered bases written in lower cases,
  ▹   removal of significantly under-covered contigs,
  ▹   multi-allelic bases transformed into degenerated ones.

Usage

Run CoPro without option to read the following documentation:

 CoPro

 USAGE:      CoPro  -i <sam> -r <fasta> -o <name>  [-q <phr>]    [-Q <phr>]     [-s]
                                                   [-p <pval>] [-d] [-f <prop>] [-n]
                                                   [-v]         [-V]            [-h]
 OPTIONS:
   -i <infile>  read alignment  in SAM format;  set "-" to read  from standard input
                (mandatory)
   -r <infile>  genome sequence file  in FASTA format;  should contain  at least one
                sequence used for read alignments (mandatory)
   -o <string>  basename for output files (mandatory)
   -q <int>     minimum allowed base Phred score;  sequenced bases associated with a
                Phred score  smaller than  the specified  cutoff are  not considered
                (default: 13)
   -Q <int>     minimum allowed  mapping Phred score;  aligned reads associated with
                a Phred score smaller  than the specified cutoff  are not considered
                (default: 0)
   -s           allow secondary alignments (default: not allowed)
   -p <real>    p-value to  determine the  coverage depth  confidence interval  (CI)
                (default: 0.005 corresponding to 99% CI)
   -d           discard contigs with significantly low coverage depth
   -f <real>    minimum proportion to infer majority-rule character states (default:
                0.0 for no inference; otherwise at least 0.6666 is recommended)
   -n           only use N for degenerated majority-rule character states
   -v           verbose mode
   -V           print version and exit
   -h           print this help and exit

Notes

Method

The SAM-formatted read alignments are first parsed to count the number of sequenced bases aligned against every reference genome position. Of note, non-primary alignments are not considered by default. These counts are next transformed into the histogram of observed per-base coverage depth, the so-called genome coverage profile.

Let d(x) be a histogram on a finite number of bins. The area A(d) induced by d is simply the sum of the values d(x) with x varying among the different bins (i.e. the total number of observations). Let δ(x) be the unimodal histogram with the smallest possible area A(δ) such that δ(x) ≥ d(x) for every x (see figure below). By definition, one always have A(d) ≤ A(δ), with equality obtained only when d is monomodal. The ratio m := A(d) / A(δ) is then called the momo (monomodality) index; it quantifies whether d is unimodal (i.e. m = 1), almost unimodal (e.g. m > 0.999) or clearly multimodal (e.g. m < 0.99). The momo index of the observed coverage profile is then computed in order to assess its expected monomodality.

momo

It is worth noting that the momo index of any histogram d can be very easily computed using the following algorithm:

  • xmin = smallest bin of d ;
  • xmod = mode of d ;
  • xmax = highest bin of d ;
  • Ad = Aδ = 0 ;
  • δ = d(xmin) ;
  • for each x = xmin, xmin + 1, xmin + 2, …, xmod − 1
    do
      • Ad += d(x) ;
      • if d(x) > δ then δ = d(x) ; fi
      • Aδ += δ ;
    done
  • δ = d(xmax) ;
  • for each x = xmax, xmax − 1, xmax − 2, …, xmod
    do
      • Ad += d(x) ;
      • if d(x) > δ then δ = d(x) ; fi
      • Aδ += δ ;
    done
  • momo = Ad / Aδ ;

A generalized Poisson distribution GP(λθ) is next fitted to the observed coverage distribution by optimizing a weighted least-square criterion that deals with multimodal coverage profiles (if any). For each coverage depth x, the GP(λθ) probability mass function is given by

$$ f_x(\lambda, \theta) = \frac{\lambda (\lambda + x \theta)^{x-1}}{x! \text{ } e^{\lambda + x \theta}} $$ where λ > 0. This theoretical distribution is able to deal with under- and over-dispersed count data when θ < 0 and θ > 0, respectively, and is equivalent to the Poisson distribution when θ = 0. Its mean and variance are λ / (1 − θ) and λ / (1 − θ)3, respectively (Consul and Jain 1973, Consul and Shoukri 1985, 1988, Faroughi et al. 2025).

In order to quantify the level of fit between the observed coverage profile d(x) and the GP distribution A(dfx(λθ), a Q-Q plot is built to derive its coefficient of determination R2. Finally, given a specified p-value (option -p; default: 0.005), the coverage depth (1 − 2p) confidence interval (CI) is directly determined from the GP cumulative distribution function F, i.e. F(p) and F(1 − p).

In practice, when dealing with sequencing reads aligned against their genome assembly, a momo index close to 1 is expected, as multimodal coverage profiles are often the consequence of contaminated sequencing data (e.g. more than one isolates). In complement, when momo ≈ 1, the R2 fitting index is also expected to be close to 1; high momo together with low R2 are often caused by negatively skewed coverage profiles, indicative of some genome regions that are insufficiently covered by high-quality bases. Of note, these two quality control metrics are particularly expected to be close to 1 when the sequencing reads were prealably subjected to a (recommended) digital normalization procedure (e.g. Legrand et al. 2022).

In the output FASTA file, the average coverage depth (covr) and the F(p) lower bound (cutoff) are specified in each header. Every assembled base with observed coverage depth lower than F(p) is considered as significantly under-covered, and is written in lower case in the output FASTA file.

Finally, when option -f is set (i.e. f > 0), each multi-allelic position of the genome assembly can be corrected. For each position, the aligned bases are sorted according to the descending order of their proportion, and the set S of allele(s) is built until the cumulative proportion is higher than f. When S contains only one base, then the position is mono-allelic, e.g. A:0.9,T:0.1,C:0.0,G:0.0 leads to S = {A} with f = 2/3. Otherwise (i.e. |S| > 1), the position is multi-allelic and the degenerated nucleotide corresponding to the S content is written in the output files, e.g. A:0.64,T:0.26,G:0.08,C:0.02 leads to S = {A,T} with f = 2/3 (degenerated nucleotide W). Trivially, the expected number of corrected bases is larger when the value of f is close to 1.0; setting f = 2/3 is recommended to correct likely multi-allelic positions, but setting higher cutoff (e.g. f = 0.8) can be useful to obtain an enhanced genome assembly with only highly-confident assembled bases.

Example

The directory example/ contains the three following files:
•  hts.fastq, a FASTQ-formatted file containing short sequencing reads from an Escherichia phage phiX174,
•  phiX.fasta, a genome assembly of the phage phiX174 (5,386 bps),
•  hts.sam, a SAM-formatted file corresponding to the read alignments against the genome assembly.

CoPro can be run to compute the coverage profile of this dataset using the following command line:

CoPro -i hts.sam -r phiX.fasta -o out -f 0.6666 -v

As the option -v is set, the following output is displayed:

> Sequence(s):            phiX.fasta
+ phiX174  5386 bps
no. sequences             1
no. bases                 5386

> Read alignments:        hts.sam
no. selected reads        3028
no. mapped positions      5386
avg. coverage             45x

> Coverage profile:
mode                      43
momo                      0.9763
GP(lambda, theta)         (61.5612 , -0.369901)
R2                        0.9252
GP mean                   44.9
coverage 99.00% CI        [33 , 58]

> Output files:
+ coverage profile        out.txt
+ coverage details        out.tsv
+ genome sequence(s)      out.fasta

This shows that each of the 5,386 positions of the reference genome are mapped, leading to an average coverage depth of 45×. The momo index (0.9763) assesses that the coverage profile is not unimodal. However, the GP fitting (λ = 61.5612, θ = −0.369901) seems accurate (e.g. R2 = 0.9252) and leads to a (1 − 2p =)99% confidence interval (CI) of [33× , 58×] for the observed coverage depths. Therefore, any genome position with coverage depth < 33× can be considered as significantly under-covered.

These different statistics are summarized into out.txt for the observed distribution (i.e. the coverage profile, represented using character =) and the fitted theoretical one (represented using character *):

== observed coverage distribution
   no.pos  5386
   avg     44.98
   mode    43
   momo    0.97625521
** GP(lambda,theta) distribution
   lambda  61.56118278
   theta   -0.36990138
   mean    44.94
   R2      0.92519044
-- confidence interval (p-value=0.00500)
   low     33
   sup     58
         0                                                                               438
cov      |-------------------------------------------------------------------------------|
...
3        *                                                                                 1
4        *                                                                                 0
5        *                                                                                 0
6        *                                                                                 1
7        *                                                                                 0
8        *                                                                                 1
9        *=                                                                                3
10       *                                                                                 1
11       *                                                                                 0
12       *                                                                                 0
13       *                                                                                 0
14       *=                                                                                3
15       *                                                                                 2
16       *                                                                                 2
17       *                                                                                 1
18       *                                                                                 1
19       *=                                                                                8
20       *                                                                                 2
21       *==                                                                               9
22       *=                                                                                5
23       *=                                                                                7
24       *=                                                                                3
25       *=                                                                                3
26       *==                                                                               9
27       *=                                                                                3
28       *=                                                                                5
29       *=                                                                                3
30       =*                                                                                4
31       =*                                                                                5
32       ==*                                                                               7
33   low-==  *                                                                             4
34       ======*                                                                           29
35       ======    *                                                                       30
36       ========       *                                                                  39
37       ===================== *                                                           111
38       ==============================*                                                   158
39       ====================================== *                                          201
40       =================================================*                                265
41       ===========================================================*                      321
42       ===================================================================*==            380
43       ==========================================================================*=====  433
44       ===============================================================================*  430
45       =============================================================================== * 428
46       =====================================================================         *   370
47       =========================================================================*=       405
48       =================================================================*=======         396
49       ========================================================*=                        312
50       ===============================================*=                                 261
51       =====================================*                                            198
52       =========================   *                                                     132
53       ================     *                                                            82
54       ===============*                                                                  76
55       ==========*======                                                                 87
56       ======*===                                                                        47
57       ====*===                                                                          37
58   sup-==*===                                                                            26
59       =*=                                                                               10
60       =*==                                                                              14
61       *=                                                                                8
62       *=                                                                                5
63       *                                                                                 2
>63      =                                                                                 0

As the lower bound of the 99% CI is 33×, the 89 nucleotides associated with a coverage depth below this cutoff are assessed as significantly under-covered and are therefore written in lower case into out.fasta, i.e.

$>  egrep -v "#|>" out.tsv  |  egrep -c "a|c|g|t"
89

Of note, as the option -f was set to 0.6666, one multi-allelic position was rewritten into out.fasta, i.e.

$>  sed -n 1p out.tsv ; grep "1$" out.tsv
#     ref  seq  A    C    G    T    gap  rev  var
863   A    W    28   0    0    15   0    25   1

Indeed, at position 863, one can observe (28+15=)43 aligned reads (25 rev, and 43−25=18 fwd), leading to (28/43=)65% A and (15/43=)35% T.

References

Consul PC, Jain GC (1973) A generalization of the poisson distribution. Technometrics, 15(4):791:799. doi:10.1080/00401706.1973.10489112

Consul PC, Shoukri MM (1985) The generalized poisson distribution when the sample mean is larger than the sample variance. Communications in Statistics - Simulation and Computation, 14(3):667-681. doi:10.1080/03610918508812463

Consul PC, Shoukri MM (1988) Some Chance Mechanisms Related to a Generalized Poisson Probability Model. American Journal of Mathematical and Management Sciences, 8(1–2):181–202. doi:10.1080/01966324.1988.10737237

Faroughi P, Li S, Ren J (2025) Generalized Poisson random variable: its distributional properties and actuarial applications. Annals of Actuarial Science, 19:140-158. doi:10.1017/S1748499524000198

Legrand V, Kergrohen T, Joly N, Criscuolo A (2022) ROCK: digital normalization of whole genome sequencing data. Journal of Open Source Software, 7(73):3790. doi:10.21105/joss.03790

Lindner MS, Kollock M, Zickmann F, Renard BY (2013) Analyzing genome coverage profiles with applications to quality control in metagenomics. Bioinformatics, 29(10):1260-1267. doi:10.1093/bioinformatics/btt147