Effective Genome Size¶

A number of tools can accept an “effective genome size”. This is defined as the length of the “mappable” genome. There are two common alternative ways to calculate this:

1. The number of non-N bases in the genome.
2. The number of regions (of some size) in the genome that are uniquely mappable (possibly given some maximal edit distance).

Option 1 can be computed using faCount from Kent’s tools. The effective genome size for a number of genomes using this method is given below:

Genome	Effective size
GRCh37	2864785220
GRCh38	2913022398
GRCm37	2620345972
GRCm38	2652783500
dm3	162367812
dm6	142573017
GRCz10	1369631918
WBcel235	100286401
TAIR10	119481543

These values only appropriate if multimapping reads are included. If they are excluded (or there’s any MAPQ filter applied), then values derived from option 2 are more appropriate. These are then based on the read length. We can approximate these values for various read lengths using the khmer program program and unique-kmers.py in particular. A table of effective genome sizes given a read length using this method is provided below:

Read length	GRCh37	GRCh38	GRCm37	GRCm38	dm3	dm6	GRCz10	WBcel235
50	2685511504	2701495761	2304947926	2308125349	130428560	125464728	1195445591	95159452
75	2736124973	2747877777	2404646224	2407883318	135004462	127324632	1251132686	96945445
100	2776919808	2805636331	2462481010	2467481108	139647232	129789873	1280189044	98259998
150	2827437033	2862010578	2489384235	2494787188	144307808	129941135	1312207169	98721253
200	2855464000	2887553303	2513019276	2520869189	148524010	132509163	1321355241	98672758

deepTools Galaxy.

code @ github.