Effective Genome Size

A number of tools can accept an “effective genome size”. This is defined as the length of the “mappable” genome. There are two common alternative ways to calculate this:

1. The number of non-N bases in the genome.
2. The number of regions (of some size) in the genome that are uniquely mappable (possibly given some maximal edit distance).

Option 1 can be computed using faCount from Kent’s tools. The effective genome size for a number of genomes using this method is given below:

Genome

Effective size

GRCh37

2864785220

GRCh38

2913022398

GRCm37

2620345972

GRCm38

2652783500

dm3

162367812

dm6

142573017

GRCz10

1369631918

WBcel235

100286401

TAIR10

119481543

These values only appropriate if multimapping reads are included. If they are excluded (or there’s any MAPQ filter applied), then values derived from option 2 are more appropriate. These are then based on the read length. We can approximate these values for various read lengths using the khmer program program and unique-kmers.py in particular. A table of effective genome sizes given a read length using this method is provided below:

Read length

GRCh37

GRCh38

GRCm37

GRCm38

dm3

dm6

GRCz10

WBcel235

50

2685511504

2701495761

2304947926

2308125349

130428560

125464728

1195445591

95159452

75

2736124973

2747877777

2404646224

2407883318

135004462

127324632

1251132686

96945445

100

2776919808

2805636331

2462481010

2467481108

139647232

129789873

1280189044

98259998

150

2827437033

2862010578

2489384235

2494787188

144307808

129941135

1312207169

98721253

200

2855464000

2887553303

2513019276

2520869189

148524010

132509163

1321355241

98672758

deepTools Galaxy.

code @ github.