GPLv3 license JAVA

CoPro

CoPro is a command line program written in Java to infer the coverage profile of a genome assembly (Lindner et al 2013). It analyses the sequencing reads aligned against their own assembly in order to assess whether the derived histogram over all observed per-base coverages is expectedly unimodal or not.

CoPro returns different results:
  ▹   average read coverage per contig,
  ▹   coverage detail of every assembled base,
  ▹   Fraction of the genome Under several coverage cutoffs (FU10, FU20, FU30, FU40, FU50),
  ▹   monomodality (momo) index of the genome coverage profile (GCP),
  ▹   fitted parameters of the associated generalized Poisson (GP) distribution,
  ▹   Bhattacharyya coefficient (BC) to asssess the fit between the GCP and the fitted GP distribution,
  ▹   coverage depth confidence interval.

Of note, both momo and BC indices can be used as quality control metrics to assess the overall coverage depth sufficiency of a whole genome sequencing data.

In complement, CoPro enhances the genome assembly from its coverage profile:
  ▹   significantly under-covered bases written in lower cases,
  ▹   removal of significantly under-covered contigs,
  ▹   multi-allelic bases transformed into degenerated ones.

Usage

Run CoPro without option to read the following documentation:

 USAGE:      CoPro  -i <sam> -r <fasta> -o <name>  [-q <phr>]    [-Q <phr>]  [-s]
                                                   [-p <pval>]   [-d]
                                                   [-c <cutoff>] [-f <prop>] [-n]
                                                   [-v]          [-V]        [-h]
 OPTIONS:
   -i <infile>  read alignment  in SAM format;  set "-" to read  from standard input
                (mandatory)
   -r <infile>  genome sequence file  in FASTA format;  should contain  at least one
                sequence used for read alignments (mandatory)
   -o <string>  basename for output files (mandatory)
   -q <int>     minimum allowed base Phred score;  sequenced bases associated with a
                Phred score  smaller than  the specified  cutoff are  not considered
                (default: 13)
   -Q <int>     minimum allowed  mapping Phred score;  aligned reads associated with
                a Phred score smaller  than the specified cutoff  are not considered
                (default: 0)
   -s           allow secondary alignments (default: not allowed)
   -p <real>    p-value to  determine the  coverage depth  confidence interval  (CI)
                (default: 0.005 corresponding to 99% CI)
   -d           discard contigs with significantly low coverage depth
   -c <real>    ratio cutoff to assess multi-allelic positions  (default: 1.0 for no
                assessment; otherwise 0.6 is recommended)
   -f <real>    minimum proportion to infer majority-rule character states at multi-
                allelic positions (default: 0.8)
   -n           only use N for degenerated majority-rule character states
   -v           verbose mode
   -V           print version and exit
   -h           print this help and exit

Notes

Methods

Genome coverage profile

The SAM-formatted read alignments are first parsed to count the number of sequenced bases aligned against every reference genome position. Of note, non-primary alignments are not considered by default. These counts are next transformed into the histogram of observed per-base coverage depth, the so-called genome coverage profile.

FU indices

The FU10, FU20, FU30, FU40 and FU50 indices are the fractions of the genome positions for which the coverage depth is under 10×, 20×, 30×, 40× and 50×, respectively. They enable to quickly check whether a genome assembly is locally under-covered, or not. In practice, a genome assembly with FU50 > 15% is often too fragmented, suggesting a new sequencing to improve the overall coverage depth.

Mono-modality index (momo)

Let d(x) be a histogram on a finite number of bins. The area A(d) induced by d is simply the sum of the values d(x) with x varying among the different bins (i.e. the total number of observations). Let δ(x) be the unimodal histogram with the smallest possible area A(δ) such that δ(x) ≥ d(x) for every x (see figure below). By definition, one always have A(d) ≤ A(δ), with equality obtained only when d is monomodal. The ratio m := A(d) / A(δ) is then called the momo (monomodality) index; it quantifies whether d is unimodal (i.e. m = 1), almost unimodal (e.g. m > 0.999) or clearly multimodal (e.g. m < 0.99). The momo index of the observed coverage profile is then computed in order to assess its expected monomodality.

momo

It is worth noting that the momo index of any histogram d can be very easily computed using the following algorithm:

  • xmin = smallest bin of d ;
  • xmod = mode of d ;
  • xmax = highest bin of d ;
  • Ad = Aδ = 0 ;
  • δ = d(xmin) ;
  • for each x = xmin, xmin + 1, xmin + 2, …, xmod − 1
    do
      • Ad += d(x) ;
      • if d(x) > δ then δ = d(x) ; fi
      • Aδ += δ ;
    done
  • δ = d(xmax) ;
  • for each x = xmax, xmax − 1, xmax − 2, …, xmod
    do
      • Ad += d(x) ;
      • if d(x) > δ then δ = d(x) ; fi
      • Aδ += δ ;
    done
  • momo = Ad / Aδ ;

In practice, when dealing with sequencing reads aligned against their (haploid) genome assembly, a momo index close to 1 is expected, as multimodal coverage profiles are often the consequence of contaminated sequencing data (e.g. more than one isolates). A genome coverage profile with momo < 0.99 is generally indicative of a contaminated sample when dealing with prokaryote genomes.

Generalized Poisson distribution (GP)

A generalized Poisson distribution GP(λθ) is next fitted to the observed coverage distribution by optimizing a weighted least-square criterion that deals with multimodal coverage profiles (if any). For each coverage depth x, the GP(λθ) probability mass function is given by

$$ f_x(\lambda, \theta) = \frac{\lambda (\lambda + x \theta)^{x-1}}{x! \text{ } e^{\lambda + x \theta}} $$ where λ > 0. This theoretical distribution is able to deal with under- and over-dispersed count data when θ < 0 and θ > 0, respectively, and is equivalent to the Poisson distribution when θ = 0. Its mean and variance are λ / (1 − θ) and λ / (1 − θ)3, respectively (Consul and Jain 1973, Consul and Shoukri 1985, 1988, Faroughi et al. 2025).

Bhattacharyya coefficient (BC)

In order to quantify the level of fit between the observed coverage profile d(x) and the GP distribution A(dfx(λθ), the Bhattacharyya (1942, 1943, 1946) coefficient is estimated using the following formula:

$$ BC = \sum_x \sqrt{f_x(\lambda, \theta) d(x) / A(d)} $$ When momo ≈ 1, the BC is also expected to be close to 1; high momo together with low BC are often caused by negatively skewed coverage profiles, indicative of some genome regions that are insufficiently covered by high-quality bases (also indicated by large FU indices when the overall coverage depth is expected to be > 50×). Of note, both momo and BC quality control metrics are particularly expected to be close to 1 when the sequencing reads were prealably subjected to a (recommended) digital normalization procedure (e.g. Legrand et al. 2022).

Coverage depth confidence interval

Given a specified p-value (option -p; default: 0.005), the coverage depth (1 − 2p) confidence interval (CI) is directly determined from the GP cumulative distribution function F, i.e. F(p) and F(1 − p). In the output FASTA file, the average coverage depth (covr) and the F(p) lower bound (cutoff) are specified in each header. Every assembled base with observed coverage depth lower than F(p) is considered as significantly under-covered, and is written in lower case in the output FASTA file. In general, it is expected that the genome fraction corresponding to these under-covered position remains quite low (e.g. <3%).

Multi-allelic positions

For each genome position, the proportions of A, C, G and T within the covering reads are sorted in descending order and denoted f1 ≥ f2 ≥ f3 ≥ f4. In most cases, it is expected that f1 ≈ 1.00 and f2 ≈ f3 ≈ f4 ≈ 0.00, leading to a non-ambiguous mono-allelic position; however, when f1 decreases (e.g. 0.7 > f1 ≥ f2 > 0.3), the probability that the position is multi-allelic can become non-negligible. For a given proportion 1.00 ≥ f1 ≥ 0.25 of the most occurring base, it can be easily shown that the proportion f2 of the second most occurring base is bounded by f1 (see the gray triangle in the following figure). In consequence, the multi-allelic status of a given position can be assessed by the ratio f2 / f1 (i.e. the slope of the green line).

f1.f2

Indeed, when f2 / f1 is low, the position is generally considered as mono-allelic in practice, but the multi-allelic status becomes likely when this ratio becomes high (see examples in the table below).

(f1,f2,f3,f4) f2 / f1
(0.90, 0.10, 0.00, 0.00) 0.10 / 0.90 ≈ 0.11
(0.80, 0.07, 0.07, 0.06) 0.07 / 0.80 ≈ 0.09
(0.80, 0.20, 0.10, 0.10) 0.20 / 0.80 = 0.25
(0.70, 0.10, 0.10, 0.10) 0.10 / 0.70 ≈ 0.14
(0.70, 0.15, 0.15, 0.00) 0.15 / 0.70 ≈ 0.21
(0.70, 0.30, 0.29, 0.01) 0.30 / 0.70 ≈ 0.43
(0.60, 0.14, 0.13, 0.13) 0.14 / 0.60 ≈ 0.23
(0.60, 0.26, 0.24, 0.00) 0.26 / 0.60 ≈ 0.43
(0.60, 0.40, 0.00, 0.00) 0.40 / 0.60 ≈ 0.66

A genome position can then be assessed as mono-allelic when the ratio f2 / f1 is lower of equal than a fixed cutoff c (option -c). Varying this cutoff enables to be very tolerant (e.g. c ≈ 1.0) or very conservative (e.g. c ≈ 0.5) in determining the allowed domain of f2 (see figure below).

f1.f2.c

In practice, is is recommended to set c = 0.6, but slightly different values can be preferred depending on the user’s needs. As shown in the figure below, A:0.55,T:0.30,G:0.10,C:0.05 can be assessed as a mono- or a multi-allelic position when setting c = 0.6 or c = 0.5, respectively, whereas A:0.55,T:0.35,G:0.10,C:0.00 can be assessed as a mono- or a multi-allelic position when setting c = 0.7 or c = 0.6, respectively

f1.f2.pix

Finally, each multi-allelic position of the genome assembly is corrected by setting the degenerated nucleotide corresponding to the set S of distinct alleles. The set S is built with the different most occurring alleles until the cumulative proportion is higher than a fixed threshold (option -f; default: 0.8) (see examples below).

(A, C, G, T) S degenerated
nucleotide
(0.60, 0.02, 0.00, 0.38) {A,T}
(0.60+0.38 = 0.98 > 0.8)
W
(0.45, 0.00, 0.55, 0.00) {G,A}
(0.55+0.45 = 1.00 > 0.8)
R
(0.10, 0.35, 0.30, 0.25) {C,G,T}
(0.35+0.30+0.25 = 0.90 > 0.8)
B

Of note, every degenerated nucleotide can be replaced by a simple N when setting the option -n.

Example

The directory example/ contains the three following files:
•  hts.fastq, a FASTQ-formatted file containing short sequencing reads from an Escherichia phage phiX174,
•  phiX.fasta, a genome assembly of the phage phiX174 (5,386 bps),
•  hts.sam, a SAM-formatted file corresponding to the read alignments against the genome assembly.

CoPro can be run to compute the coverage profile of this dataset using the following command line:

CoPro -i hts.sam -r phiX.fasta -o out -c 0.5 -v

As the option -v is set, the following output is displayed:

> Sequence(s):            phiX.fasta
+ phiX174  5386 bps
no. sequences             1
no. bases                 5386

> Read alignments:        hts.sam
no. selected reads        3028
no. mapped positions      5386
avg. coverage             45x

> Coverage profile:
mode                      43
momo                      0.9763
FU10                      0.11%
FU20                      0.45%
FU30                      1.36%
FU40                      12.27%
FU50                      81.71%
GP(lambda, theta)         (61.8225 , -0.380322)
BC                        0.9897
GP mean                   44.8
coverage 99.00% CI        [32 , 57]
FU32                      1.52%

> Output files:
+ coverage profile        out.txt
+ coverage details        out.tsv
+ genome sequence(s)      out.fasta

This shows that each of the 5,386 positions of the reference genome are mapped, leading to an average coverage depth of 45×. The momo index (0.9763) assesses that the coverage profile is not unimodal. However, the GP fitting (λ = 61.8225, θ = −0.380322) seems accurate (e.g. BC = 0.9897) and leads to a (1 − 2p =)99% confidence interval (CI) of [32× , 57×] for the observed coverage depths. Therefore, any genome position with coverage depth < 32× can be considered as significantly under-covered (i.e. FU32 = 1.52% of the reference genome).

These different statistics are summarized into out.txt for the observed distribution (i.e. the coverage profile, represented using character =) and the fitted theoretical one (represented using character *):

== observed coverage distribution
   no.pos  5386
   avg     44.98
   mode    43
   momo    0.97625521
   FU10    0.11%
   FU20    0.45%
   FU30    1.36%
   FU40    12.27%
   FU50    81.71%
** GP(lambda,theta) distribution
   lambda  61.82249692
   theta   -0.38032174
   mean    44.79
   BC      0.98968153
-- confidence interval (p-value=0.00500)
   low     32
   sup     57
   FU32    1.52%
         0                                                                               442
cov      |-------------------------------------------------------------------------------|
...
3        *                                                                                 1
4        *                                                                                 0
5        *                                                                                 0
6        *                                                                                 1
7        *                                                                                 0
8        *                                                                                 1
9        *=                                                                                3
10       *                                                                                 1
11       *                                                                                 0
12       *                                                                                 0
13       *                                                                                 0
14       *=                                                                                3
15       *                                                                                 2
16       *                                                                                 2
17       *                                                                                 1
18       *                                                                                 1
19       *=                                                                                8
20       *                                                                                 2
21       *==                                                                               9
22       *=                                                                                5
23       *=                                                                                7
24       *=                                                                                3
25       *=                                                                                3
26       *==                                                                               9
27       *=                                                                                3
28       *=                                                                                5
29       *=                                                                                3
30       =*                                                                                4
31       =*                                                                                5
32   low-==*                                                                               7
33       ==  *                                                                             4
34       ====== *                                                                          29
35       ======    *                                                                       30
36       ========        *                                                                 39
37       ===================== *                                                           111
38       ==============================*                                                   158
39       =====================================   *                                         201
40       ================================================= *                               265
41       =========================================================== *                     321
42       ====================================================================*=            380
43       ===========================================================================*===   433
44       ===============================================================================*  430
45       =============================================================================== * 428
46       ====================================================================         *    370
47       ========================================================================*=        405
48       ================================================================*========         396
49       =======================================================*==                        312
50       =============================================*==                                  261
51       ===================================*=                                             198
52       =========================  *                                                      132
53       ================   *                                                              82
54       =============*=                                                                   76
55       =========*=======                                                                 87
56       ======*===                                                                        47
57   sup-===*====                                                                          37
58       ==*===                                                                            26
59       =*=                                                                               10
60       =*==                                                                              14
61       *=                                                                                8
62       *=                                                                                5
63       *                                                                                 2
>63      =                                                                                 0

As the lower bound of the 99% CI is 32×, the 82 nucleotides associated with a coverage depth below this cutoff are assessed as significantly under-covered and are therefore written in lower case into out.fasta, i.e.

$>  egrep -v "#|>" out.tsv  |  egrep -c "a|c|g|t"
82

Of note, as the option -c was set to 0.5, one multi-allelic position was rewritten into out.fasta, i.e.

$>  sed -n 1p out.tsv ; grep "1$" out.tsv
#     ref  seq  A    C    G    T    gap  rev  var
863   A    W    28   0    0    15   0    25   1

Indeed, at position 863, one can observe (28+15=)43 aligned reads (25 rev, and 43−25=18 fwd), leading to (28/43≈)65.1% A and (15/43≈)34.8% T. As 15/28≈0.53 is greater than the cutoff 0.5, the position was corrected by setting the degenerated nucleotide W (corresponding to S = {A,T}).

References

Bhattacharyya A (1942) On discrimination and divergence. Proceedings of the Twenty-ninth Indian Science Congress. Asiatic Society of Bengal. Part III, Section I: Mathematics & Statistics, p. 13. [link]

Bhattacharyya A (1943) On a Measure of Divergence between Two Statistical Populations Defined by Their Probability Distributions. Bulletin of the Calcutta Mathematical Society, 35:99-109. [link]

Bhattacharyya A (1946) On a Measure of Divergence between Two Multinomial Populations. Sankhyā: The Indian Journal of Statistics, 7(4):401-406. [link]

Consul PC, Jain GC (1973) A generalization of the poisson distribution. Technometrics, 15(4):791:799. doi:10.1080/00401706.1973.10489112

Consul PC, Shoukri MM (1985) The generalized poisson distribution when the sample mean is larger than the sample variance. Communications in Statistics - Simulation and Computation, 14(3):667-681. doi:10.1080/03610918508812463

Consul PC, Shoukri MM (1988) Some Chance Mechanisms Related to a Generalized Poisson Probability Model. American Journal of Mathematical and Management Sciences, 8(1–2):181–202. doi:10.1080/01966324.1988.10737237

Faroughi P, Li S, Ren J (2025) Generalized Poisson random variable: its distributional properties and actuarial applications. Annals of Actuarial Science, 19:140-158. doi:10.1017/S1748499524000198

Legrand V, Kergrohen T, Joly N, Criscuolo A (2022) ROCK: digital normalization of whole genome sequencing data. Journal of Open Source Software, 7(73):3790. doi:10.21105/joss.03790

Lindner MS, Kollock M, Zickmann F, Renard BY (2013) Analyzing genome coverage profiles with applications to quality control in metagenomics. Bioinformatics, 29(10):1260-1267. doi:10.1093/bioinformatics/btt147