Effective Genome Size¶
A number of tools can accept an “effective genome size”. This is defined as the length of the “mappable” genome. There are two common alternative ways to calculate this:
1. The number of non-N bases in the genome.
2. The number of regions (of some size) in the genome that are uniquely mappable (possibly given some maximal edit distance).
Option 1 can be computed using faCount from Kent’s tools. The effective genome size for a number of genomes using this method is given below:
| Genome | Effective size | 
|---|---|
| GRCh37 | 2864785220 | 
| GRCh38 | 2913022398 | 
| GRCm37 | 2620345972 | 
| GRCm38 | 2652783500 | 
| dm3 | 162367812 | 
| dm6 | 142573017 | 
| GRCz10 | 1369631918 | 
| WBcel235 | 100286401 | 
| TAIR10 | 119481543 | 
These values only appropriate if multimapping reads are included. If they are excluded (or there’s any MAPQ filter applied), then values derived from option 2 are more appropriate. These are then based on the read length. We can approximate these values for various read lengths using the khmer program program and unique-kmers.py in particular. A table of effective genome sizes given a read length using this method is provided below:
| Read length | GRCh37 | GRCh38 | GRCm37 | GRCm38 | dm3 | dm6 | GRCz10 | WBcel235 | 
|---|---|---|---|---|---|---|---|---|
| 50 | 2685511504 | 2701495761 | 2304947926 | 2308125349 | 130428560 | 125464728 | 1195445591 | 95159452 | 
| 75 | 2736124973 | 2747877777 | 2404646224 | 2407883318 | 135004462 | 127324632 | 1251132686 | 96945445 | 
| 100 | 2776919808 | 2805636331 | 2462481010 | 2467481108 | 139647232 | 129789873 | 1280189044 | 98259998 | 
| 150 | 2827437033 | 2862010578 | 2489384235 | 2494787188 | 144307808 | 129941135 | 1312207169 | 98721253 | 
| 200 | 2855464000 | 2887553303 | 2513019276 | 2520869189 | 148524010 | 132509163 | 1321355241 | 98672758 |