CoPro is a command line program written in Java to infer the coverage profile of a genome assembly (Lindner et al 2013). It analyses the sequencing reads aligned against their own assembly in order to assess whether the derived histogram over all observed per-base coverages is expectedly unimodal or not.
CoPro returns different results:
▹ average read
coverage per contig,
▹ coverage detail of every assembled
base,
▹ monomodality (momo) index of the genome
coverage profile,
▹ fitted parameters of the associated
generalized Poisson (GP) distribution,
▹ R2
fitting index,
▹ coverage depth confidence interval.
Of note, both momo and R2 indices can be used as quality control metrics to assess the overall accuracy of a genome assembly.
In complement, CoPro enhances the genome assembly from its
coverage profile:
▹ significantly under-covered bases written in
lower cases,
▹ removal of significantly under-covered
contigs,
▹ multi-allelic bases transformed into degenerated
ones.
Run CoPro without option to read the following documentation:
CoPro
USAGE: CoPro -i <sam> -r <fasta> -o <name> [-q <phr>] [-Q <phr>] [-s]
[-p <pval>] [-d] [-f <prop>] [-n]
[-v] [-V] [-h]
OPTIONS:
-i <infile> read alignment in SAM format; set "-" to read from standard input
(mandatory)
-r <infile> genome sequence file in FASTA format; should contain at least one
sequence used for read alignments (mandatory)
-o <string> basename for output files (mandatory)
-q <int> minimum allowed base Phred score; sequenced bases associated with a
Phred score smaller than the specified cutoff are not considered
(default: 13)
-Q <int> minimum allowed mapping Phred score; aligned reads associated with
a Phred score smaller than the specified cutoff are not considered
(default: 0)
-s allow secondary alignments (default: not allowed)
-p <real> p-value to determine the coverage depth confidence interval (CI)
(default: 0.005 corresponding to 99% CI)
-d discard contigs with significantly low coverage depth
-f <real> minimum proportion to infer majority-rule character states (default:
0.0 for no inference; otherwise at least 0.6666 is recommended)
-n only use N for degenerated majority-rule character states
-v verbose mode
-V print version and exit
-h print this help and exit
The aligned read file (option -i) should be in SAM
format, whereas the genome sequence against which the reads were aligned
(option -r) should be in FASTA format. It is worth noting
that CoPro does not deal with hard clipping (i.e. CIGAR
operator H in SAM-formatted alignments).
To process multiple files and/or avoid writting them, or to use
BAM-compressed data, SAM-formatted alignments can be directly read from
standard input by setting -i -,
e.g.
minimap2 -a ref.fasta hts.fastq | CoPro -i - -r ref.fasta -o out
cat alg1.sam alg2.sam alg3.sam | CoPro -i - -r ref.fasta -o out
samtools view -h alg.bam | CoPro -i - -r ref.fasta -o out
Coverage profile details are written into a .txt file (parameter
values and distributions). Coverage details of every position are
written into a .tsv file (positions, reference and returned bases,
observed numbers of A, C, G, T and gaps, number of reverse reads; the
last binary column var indicates variation between the
reference and the returned bases). Modified genome assembly (updated
FASTA headers, rewritten nucleotides, discarded contigs) is written into
a .fasta file; this enhanced genome assembly is strongly depending on
options -q, -Q, -s,
-p, -d, -f and
-n.
All aligned reads associated with a mapQ Phred score (i.e. col. 5
in SAM format) lower than the specified cutoff (option -Q;
default, 0) are not considered, as well as secondary alignments
(provided that option -s is not set). All sequenced bases
associated with a Phred score (col. 11 in SAM format) lower than the
specified cutoff (option -q; default, 13) are also not
considered.
For more details about the options -p and
-f, see the method summary below.
The SAM-formatted read alignments are first parsed to count the number of sequenced bases aligned against every reference genome position. Of note, non-primary alignments are not considered by default. These counts are next transformed into the histogram of observed per-base coverage depth, the so-called genome coverage profile.
Let d(x) be a histogram on a finite number of bins. The area A(d) induced by d is simply the sum of the values d(x) with x varying among the different bins (i.e. the total number of observations). Let δ(x) be the unimodal histogram with the smallest possible area A(δ) such that δ(x) ≥ d(x) for every x (see figure below). By definition, one always have A(d) ≤ A(δ), with equality obtained only when d is monomodal. The ratio m := A(d) / A(δ) is then called the momo (monomodality) index; it quantifies whether d is unimodal (i.e. m = 1), almost unimodal (e.g. m > 0.999) or clearly multimodal (e.g. m < 0.99). The momo index of the observed coverage profile is then computed in order to assess its expected monomodality.
It is worth noting that the momo index of any histogram d can be very easily computed using the following algorithm:
• xmin = smallest bin of d ;
•
xmod = mode of d ;
•
xmax = highest bin of d ;
•
Ad = Aδ = 0 ;
• δ
= d(xmin) ;
• for
each x = xmin,
xmin + 1, xmin + 2, …,
xmod − 1
do
•
Ad += d(x) ;
•
if d(x) > δ
then δ = d(x) ;
fi
• Aδ += δ
;
done
• δ =
d(xmax) ;
• for each
x = xmax, xmax − 1,
xmax − 2, …, xmod
do
• Ad +=
d(x) ;
• if
d(x) > δ then δ
= d(x) ; fi
•
Aδ += δ ;
done
• momo = Ad / Aδ
;
A generalized Poisson distribution GP(λ, θ) is next fitted to the observed coverage distribution by optimizing a weighted least-square criterion that deals with multimodal coverage profiles (if any). For each coverage depth x, the GP(λ, θ) probability mass function is given by
$$ f_x(\lambda, \theta) = \frac{\lambda (\lambda + x \theta)^{x-1}}{x! \text{ } e^{\lambda + x \theta}} $$ where λ > 0. This theoretical distribution is able to deal with under- and over-dispersed count data when θ < 0 and θ > 0, respectively, and is equivalent to the Poisson distribution when θ = 0. Its mean and variance are λ / (1 − θ) and λ / (1 − θ)3, respectively (Consul and Jain 1973, Consul and Shoukri 1985, 1988, Faroughi et al. 2025).
In order to quantify the level of fit between the observed coverage
profile d(x) and the GP distribution
A(d) fx(λ, θ), a
Q-Q plot is built to derive its coefficient of determination
R2. Finally, given a specified p-value
(option -p; default: 0.005), the coverage depth
(1 − 2p) confidence interval (CI) is directly determined from
the GP cumulative distribution function F,
i.e. F(p) and F(1 − p).
In practice, when dealing with sequencing reads aligned against their genome assembly, a momo index close to 1 is expected, as multimodal coverage profiles are often the consequence of contaminated sequencing data (e.g. more than one isolates). In complement, when momo ≈ 1, the R2 fitting index is also expected to be close to 1; high momo together with low R2 are often caused by negatively skewed coverage profiles, indicative of some genome regions that are insufficiently covered by high-quality bases. Of note, these two quality control metrics are particularly expected to be close to 1 when the sequencing reads were prealably subjected to a (recommended) digital normalization procedure (e.g. Legrand et al. 2022).
In the output FASTA file, the average coverage depth (covr) and the F(p) lower bound (cutoff) are specified in each header. Every assembled base with observed coverage depth lower than F(p) is considered as significantly under-covered, and is written in lower case in the output FASTA file.
Finally, when option -f is set (i.e. f > 0),
each multi-allelic position of the genome assembly can be corrected. For
each position, the aligned bases are sorted according to the descending
order of their proportion, and the set S of allele(s) is built
until the cumulative proportion is higher than f. When
S contains only one base, then the position is mono-allelic,
e.g. A:0.9,T:0.1,C:0.0,G:0.0 leads to S = {A} with
f = 2/3. Otherwise (i.e. |S| > 1), the position is
multi-allelic and the degenerated nucleotide corresponding to the
S content is written in the output files,
e.g. A:0.64,T:0.26,G:0.08,C:0.02 leads to S = {A,T} with
f = 2/3 (degenerated nucleotide W). Trivially, the expected
number of corrected bases is larger when the value of f is
close to 1.0; setting f = 2/3 is recommended to correct likely
multi-allelic positions, but setting higher cutoff
(e.g. f = 0.8) can be useful to obtain an enhanced genome
assembly with only highly-confident assembled bases.
The directory example/ contains the three following
files:
• hts.fastq, a FASTQ-formatted file containing
short sequencing reads from an Escherichia phage phiX174,
• phiX.fasta, a genome assembly of the phage phiX174 (5,386
bps),
• hts.sam, a SAM-formatted file corresponding to the
read alignments against the genome assembly.
CoPro can be run to compute the coverage profile of this dataset using the following command line:
CoPro -i hts.sam -r phiX.fasta -o out -f 0.6666 -vAs the option -v is set, the following output is
displayed:
> Sequence(s): phiX.fasta
+ phiX174 5386 bps
no. sequences 1
no. bases 5386
> Read alignments: hts.sam
no. selected reads 3028
no. mapped positions 5386
avg. coverage 45x
> Coverage profile:
mode 43
momo 0.9763
GP(lambda, theta) (61.5612 , -0.369901)
R2 0.9252
GP mean 44.9
coverage 99.00% CI [33 , 58]
> Output files:
+ coverage profile out.txt
+ coverage details out.tsv
+ genome sequence(s) out.fasta
This shows that each of the 5,386 positions of the reference genome are mapped, leading to an average coverage depth of 45×. The momo index (0.9763) assesses that the coverage profile is not unimodal. However, the GP fitting (λ = 61.5612, θ = −0.369901) seems accurate (e.g. R2 = 0.9252) and leads to a (1 − 2p =)99% confidence interval (CI) of [33× , 58×] for the observed coverage depths. Therefore, any genome position with coverage depth < 33× can be considered as significantly under-covered.
These different statistics are summarized into out.txt for
the observed distribution (i.e. the coverage profile, represented using
character =) and the fitted theoretical one (represented
using character *):
== observed coverage distribution
no.pos 5386
avg 44.98
mode 43
momo 0.97625521
** GP(lambda,theta) distribution
lambda 61.56118278
theta -0.36990138
mean 44.94
R2 0.92519044
-- confidence interval (p-value=0.00500)
low 33
sup 58
0 438
cov |-------------------------------------------------------------------------------|
...
3 * 1
4 * 0
5 * 0
6 * 1
7 * 0
8 * 1
9 *= 3
10 * 1
11 * 0
12 * 0
13 * 0
14 *= 3
15 * 2
16 * 2
17 * 1
18 * 1
19 *= 8
20 * 2
21 *== 9
22 *= 5
23 *= 7
24 *= 3
25 *= 3
26 *== 9
27 *= 3
28 *= 5
29 *= 3
30 =* 4
31 =* 5
32 ==* 7
33 low-== * 4
34 ======* 29
35 ====== * 30
36 ======== * 39
37 ===================== * 111
38 ==============================* 158
39 ====================================== * 201
40 =================================================* 265
41 ===========================================================* 321
42 ===================================================================*== 380
43 ==========================================================================*===== 433
44 ===============================================================================* 430
45 =============================================================================== * 428
46 ===================================================================== * 370
47 =========================================================================*= 405
48 =================================================================*======= 396
49 ========================================================*= 312
50 ===============================================*= 261
51 =====================================* 198
52 ========================= * 132
53 ================ * 82
54 ===============* 76
55 ==========*====== 87
56 ======*=== 47
57 ====*=== 37
58 sup-==*=== 26
59 =*= 10
60 =*== 14
61 *= 8
62 *= 5
63 * 2
>63 = 0
As the lower bound of the 99% CI is 33×, the 89 nucleotides associated with a coverage depth below this cutoff are assessed as significantly under-covered and are therefore written in lower case into out.fasta, i.e.
$> egrep -v "#|>" out.tsv | egrep -c "a|c|g|t"
89
Of note, as the option -f was set to 0.6666, one
multi-allelic position was rewritten into out.fasta, i.e.
$> sed -n 1p out.tsv ; grep "1$" out.tsv
# ref seq A C G T gap rev var
863 A W 28 0 0 15 0 25 1
Indeed, at position 863, one can observe (28+15=)43 aligned reads (25 rev, and 43−25=18 fwd), leading to (28/43=)65% A and (15/43=)35% T.
Consul PC, Jain GC (1973) A generalization of the poisson distribution. Technometrics, 15(4):791:799. doi:10.1080/00401706.1973.10489112
Consul PC, Shoukri MM (1985) The generalized poisson distribution when the sample mean is larger than the sample variance. Communications in Statistics - Simulation and Computation, 14(3):667-681. doi:10.1080/03610918508812463
Consul PC, Shoukri MM (1988) Some Chance Mechanisms Related to a Generalized Poisson Probability Model. American Journal of Mathematical and Management Sciences, 8(1–2):181–202. doi:10.1080/01966324.1988.10737237
Faroughi P, Li S, Ren J (2025) Generalized Poisson random variable: its distributional properties and actuarial applications. Annals of Actuarial Science, 19:140-158. doi:10.1017/S1748499524000198
Legrand V, Kergrohen T, Joly N, Criscuolo A (2022) ROCK: digital normalization of whole genome sequencing data. Journal of Open Source Software, 7(73):3790. doi:10.21105/joss.03790
Lindner MS, Kollock M, Zickmann F, Renard BY (2013) Analyzing genome coverage profiles with applications to quality control in metagenomics. Bioinformatics, 29(10):1260-1267. doi:10.1093/bioinformatics/btt147