Published genomes suitable for whole genome comparison

The MUMmer package provides efficient means for comparing an entire genome against another. However, until 1999 there were no two genomes of sufficient similarity to compare. With the publication of the second strain of Helicobacter pylori in 1999, following the publication of the first strain in 1997, the scientific world had its first chance to look at two complete bacterial genomes whose DNA sequences were highly similar. The number of pairs of closely-related genomes has exploded in recent years, facilitating many comparative studies. For instance, the published databases include the following genomes for which multiple strains and/or multiple species have been sequenced:

multiple strains of...
  • Agrobacterium tumefaciens
  • Bacillus anthracis
  • Brucella melitensis
  • Buchnera aphidicola
  • Chlamydophila pneumoniae
  • Escherichia coli
  • Helicobacter pylori
  • Mycobacterium tuberculosis
  • Neisseria meningitidis
  • Staphylococcus aureus
  • Streptococcus pyogenes
  • Streptococcus pneumoniae
  • Yersinia pestis
multiple species of...
  • Bacillus
  • Chlamydia
  • Clostridium
  • Corynebacterium
  • Lactobacillus
  • Listeria
  • Methanosarcina
  • Mycobacterium
  • Mycoplasma
  • Plasmodium
  • Pseudomonas
  • Pyrococcus
  • Rickettsia
  • Saccharomyces
  • Staphylococcus
  • Streptococcus
  • Thermoplasma
  • Vibrio
  • Xanthomonas
  • Xylella

Most of these genomes can be obtained from the NCBI ftp site: ftp://ftp.ncbi.nlm.nih.gov/genomes/

Timing analysis for the whole genome alignment of Human vs. Human

With the capability to align the entire human genome to itself, there is no genome too large for MUMmer. The following table gives run times and space requirements for a cross comparison of all human chromosomes. The 1st column indicates the chromosome number, with "Un" referring to unmapped contigs. Column 2 shows chromosome length and column 4 shows the length of the total genomic DNA searched against the chromosome in column 1. Column 3 shows the time to construct the suffix tree, and column 5 the time to stream the query sequence through it. Column 6 shows the maximum amount of computer memory occupied by the program and data, and column 7 shows memory usage for the suffix tree in bytes per base pair. Each human chromosome was used as a reference, and the rest of the genome was used as a query and streamed against it. To avoid duplication, we only included chromosomes in the query if they had not already been compared; thus we first used chromosome 1 as a reference, and streamed the other 23 chromosomes against it. Then we used chromosome 2 as a reference, and streamed chromosomes 3–22, X, and Y against that, and so on.

Chr Ref length
(Mbp)
Suffix time
(min)
Qry length
(Mbp)
Query time
(min)
Total space
(Mb)
Suffix space
(bytes/bp)
1 221.8 24.6 2617.1 679.5 3702 15.43
2 237.6 27.4 2379.5 625.8 3908 15.43
3 194.8 21.2 2184.7 565.0 3232 15.43
4 188.4 22.4 1996.3 518.0 3121 15.43
5 177.7 18.6 1818.6 461.4 2952 15.43
6 175.8 17.9 1642.8 407.6 2900 15.43
7 153.8 15.7 1489.0 360.1 2550 15.43
8 142.8 14.4 1346.2 322.3 2378 15.43
9 117.0 10.7 1229.2 303.7 1974 15.43
10 131.1 13.2 1098.1 263.3 2195 15.43
11 133.2 13.1 964.9 225.6 2228 15.43
12 129.4 12.5 835.5 195.9 2168 15.43
13 95.2 8.6 740.3 163.6 1633 15.44
14 88.2 7.5 652.1 141.0 1523 15.44
15 83.6 6.8 568.5 122.1 1451 15.44
16 80.9 6.4 487.6 106.3 1409 15.44
17 80.7 6.6 406.9 91.8 1406 15.44
18 74.6 6.3 332.3 78.8 1311 15.44
19 56.4 3.7 275.8 56.1 1026 15.45
20 59.4 4.6 216.4 45.8 1073 15.45
21 33.9 2.1 182.5 33.7 673 15.48
22 33.8 2.0 148.6 26.4 672 15.48
Un 1.4 0.03 147.3 10.0 164 16.96
X 147.3 14.6   4.8 2327 15.57

The Human Chromosomes can be obtained from the NCBI ftp site: ftp://ftp.ncbi.nih.gov/genomes/H_sapiens/