CAFE

Software for Computational Analysis of gene Family Evolution

The purpose of CAFE is to analyze changes in gene family size in a way that accounts for phylogenetic history and provides a statistical foundation for evolutionary inferences. The program uses a birth and death process to model gene gain and loss across a user-specified phylogenetic tree. The distribution of family sizes generated under this model can provide a basis for assessing the significance of the observed family size differences among taxa.

This repository contains code for CAFE 5, an updated version of the CAFE code base. CAFE 5 showcases new functionalities while keeping or improving several of the features available in prior releases.

Project Status: Active – The project has reached a stable, usable state and is being actively developed.

Build Status

History

The original development of the statistical framework and algorithms are described by Hahn, et al. (2005) and later implemented in the software package CAFE by De Bie, et al. (2006).

CAFE v2.0 (Hahn, Demuth, et al. 2007; Hahn, Han, et al. 2007) Included software updates and functionality that allowed users to specify different λ values for different branches on their input tree.

CAFE v3.0 (Han, et al. 2013) was a major update to CAFE 2.0, with added functionality that included: 1) the ability to correct for genome assembly and annotation error. 2) The ability to estimate separate birth (λ) and death (μ) rates using the lambdamu command. 3) The ability to estimate error in an input data set with iterative use of the errormodel command using the accompanying python script caferror.py. This version also included the addition of the rootdist command to give the user more control over simulations.

CAFE v4.0 was primarily a maintenance update. At that time formal issue tracking and a user forum was introduced.

CAFE v5.0. (Current Release) Another major update, CAFE5 showcases new functionalities while keeping or improving several of the features available in prior releases.

How to Cite

If you use CAFE5 in your work, please cite the application as

Original development of the statistical framework and algorithms implemented in CAFE are published in:

The citation for CAFE v2.0 is:

The citation for CAFE v3.1 and v4.0 is:

New Functionality

What CAFE5 does

CAFE5 implements a birth-death model for evolutionary inferences about gene family evolution. Its main task is the maximum-likelihood estimation of a global or local gene family evolutionary rates (lambda parameter) for a given data set. Briefly, CAFE5 can:

What CAFE5 does NOT do

Running CAFE5

Quick Start

For a typical CAFE analysis, users are most interested in determining two things: 1) Which gene families are rapidly evolving 2) The branches of the tree on which these families are rapidly evolving

This type of analysis requires a minimum of two input files: 1) The count data file is a tab-delimited family “counts” file that contains a column for a description of the gene family, the unique ID for each family, and a column for each taxon that has count data for each family. This file is acquired by first peforming a clustering analysis, often using software such as OrthoMCL, SwiftOrtho, FastOrtho, OrthAgogue, or OrthoFinder and then parsing the output into a table like the one below (Note: if a functional description is not desired, include this column anyway with a place holder as below (null)).

Example: mammal_gene_families.txt

Desc    Family ID   human   chimp   orang   baboon  gibbon  macaque marmoset rat    mouse   cat horse   cow
ATPase  ORTHOMCL1    52  55  54  57  54   56      56     53  52 57  55   54
(null)  ORTHOMCL2    76  51  41  39  45   36      37     67  79 37  41   49
HMG box ORTHOMCL3    50  49  48  48  46   49      48     55  52 51  47   55
(null)  ORTHOMCL4    43  43  47  53  44   47      46     59  58 51  50   55
Dynamin ORTHOMCL5    43  40  43  44  31   46      33     79  70 43  49   50
......
....
..
DnaJ    ORTHOMCL10016    45  46  50  46  46       47      46     48  49 45  44   48
  1. The tree file should contain a binary, rooted, ultrametric, tree in Newick format. Typically one obtains this tree using one of several molecular dating methods. If you are unsure if your tree is binary, rooted, or ultrametric CAFE will report this when you try to use it for an analysis. Alternatively, you can use the R package, Ape with its included functions: is.ultrametric, is.rooted, and is.binary.

Example: mammals_tree.txt

((((cat:68.710507,horse:68.710507):4.566782,cow:73.277289):20.722711,(((((chimp:4.444172,human:4.444172):6.682678,orang:11.126850):2.285855,gibbon:13.412706):7.211527,(macaque:4.567240,baboon:4.567240):16.056992):16.060702,marmoset:36.684935):57.315065):38.738021,(rat:36.302445,mouse:36.302445):96.435575);

To get a list of commands just call CAFE with the -h or –help arguments:

$ cafe5 -h

To estimate lambda with no among family rate variation issue the command:

$ cafe5 -i mammal_gene_families.txt -t mammal_tree.txt

To incorporate among family rate variation with both lambda and alpha estimated and three discrete gamma rate categories:

$ cafe5 -i mammal_gene_families.txt -t mammal_tree.txt -k 3

To estimate separate lambda values for different lineages in the tree, first identify the branches to which each lambda will apply. This can be done by making a copy of your tree, and substituting the lambda identifier (1,2,3, etc.) for the branch length values. For example, to apply a different lambda to the branches leading to human, chimp, and their ancestor, modify the branches as below.

Example chimphuman_separate_lambda.txt:
((((cat:1,horse:1):1,cow:1):1,(((((chimp:2,human:2):2,orang:1):1,gibbon:1):1,(macaque:1,baboon:1):1):1,marmoset:1):1):1,(rat:1,mouse:1):1);

For this tree, lambda #2 will be applied to branches leading to human, chimp, and their ancestor while lambda #1 will be applied to all other branches of the tree.

To run this analysis with both lambdas estimated:

$ cafe5 -i mammal_gene_families.txt -t mammal_tree.txt -y chimphuman_separate_lambda.txt 

Caveats

Tutorial

A tutorial is provided in the tutorial directory. It provides instructions on how to generate a reasonable gene family groups in the correct format, dated ultrametric trees, and basic CAFE analyses. The tutorial contains tutorial.md and some helper scripts.

### Slow Start
CAFE5 performs three different operations on either one or two models. The operations are
- Estimate Lambda - the traditional function of CAFE. Takes a tree and a file of gene family counts, and performs a maximum likelihood calculation to estimate the most likely rate of change across the entire tree.
- Simulate - Given specified values, generate an artificial list of gene families that matches the values. To generate a simulation, pass the –simulate or -s parameter. Either pass a count of families to be simulated with the parameter (–simulate=1000) or pass a –rootdist (-f) parameter with a file containing the distribution to match (see [rootdist] for the file format).
The models are - Base - Perform computations as if no gamma function is available
- Gamma - Perform computations as if each gene family can belong to a different evolutionary rate category. To use Gamma modelling, pass the -k parameter specifying the number of categories to use.
Unlike earlier versions, CAFE5 does not require a script. All options are given at once on the command line. Here is an example:
cafe5 -t examples/mammals_tree.txt -i examples/mammal_gene_families.txt -p -k 3
In this example, the -t parameter specifies a file containing the tree that CAFE uses; and the -i parameter specifies a list of gene families. The -p, in this instance given without a parameter, indicates that the root equilibrium frequency will not be a uniform distribution. The -k parameter specifies how many gamma rate categories to use.
Parameters

Input files

Output

All output will be stored to the “results” directory, unless another directory is specified with the “-o” parameter.

Examples

Search for a single lambda value using the mammal phylogeny and gene family set in the Examples directory:

../bin/cafe5 -t mammals_tree.txt -i mammal_gene_families.txt -p -o singlelambda

In earlier versions, the following script would have returned the same values:

tree ((((cat:68.710507,horse:68.710507):4.566782,cow:73.277289):20.722711,(((((chimp:4.444172,human:4.444172):6.682678,orang:11.126850):2.285855,gibbon:13.412706):7.211527,(macaque:4.567240,baboon:4.567240):16.056992):16.060702,marmoset:36.684935):57.315065):38.738021,(rat:36.302445,mouse:36.302445):96.435575)
load -i filtered_cafe_input.txt -t 10 -l reports/run6_caferror_files/cafe_log.txt
lambda -s -t ((((1,1)1,1)1,(((((1,1)1,1)1,1)1,(1,1)1)1,1)1)1,(1,1)1)

Lambda Search with Multiple Lambdas

Search for separate lambda values for the chimp/human clade and the rest of the tree separately, using the mammal phylogeny and gene family set in the Examples directory:

../bin/cafe5 -t mammals_tree.txt -i mammal_gene_families.txt -p -y chimphuman_separate_lambda.txt -o doublelambda

In earlier versions, the following script would have returned the same values:

((((cat:68,horse:68):4,cow:73):20,(((((chimp:4,human:4):6,orang:11):2,gibbon:13):7,(macaque:4,baboon:4):16):16,marmoset:36):57):38,(rat:36,mouse:36):96) 
load -i integral_test_families.txt -t 10
lambda -s -t ((((1,1)1,1)1,(((((2,2)2,2)2,2)2,(1,1)1)1,1)1)1,(1,1)1)

Error Models

Search for a single lambda value using the Newick tree of mammals in the Samples folder, and the family files from the CAFE tutorial, applying an error model:

cafe5 -t data/mammals_integral_tree.txt -i data/filtered_cafe_input.txt -p -l 0.01 -e data/cafe_errormodel_0.0548828125.txt

In earlier versions, the following script would have returned the same values:

tree ((((cat:68,horse:68):4,cow:73):20,(((((chimp:4,human:4):6,orang:11):2,gibbon:13):7,(macaque:4,baboon:4):16):16,marmoset:36):57):38,(rat:36,mouse:36):96) load -i integral_test_families.txt -t 10
errormodel -all -model cafe_errormodel_0.0548828125.txt
lambda -l 0.01 -t ((((1,1)1,1)1,(((((1,1)1,1)1,1)1,(1,1)1)1,1)1)1,(1,1)1) -score

Error Model Estimation

Estimate an error model:

cafe5 -t mammals_integral_tree.txt -i filtered_cafe_input.txt -p -e errormodel.txt

In earlier versions, the following script would have returned the same values:

load -i filtered_cafe_input.txt -t 4 -l reports/log_run6.txt
tree ((((cat:68.710507,horse:68.710507):4.566782,cow:73.277289):20.722711,(((((chimp:4.444172,human:4.444172):6.682678,orang:11.126850):2.285855,gibbon:13.412706):7.211527,(macaque:4.567240,baboon:4.567240):16.056992):16.060702,marmoset:36.684935):57.315065):38.738021,(rat:36.302445,mouse:36.302445):96.435575)
lambda -s -t ((((1,1)1,1)1,(((((1,1)1,1)1,1)1,(1,1)1)1,1)1)1,(1,1)1)
report reports/report_run6

Troubleshooting

Known Limitations

Because the random birth and death process assumes that each family has at least one gene at the root of the tree, CAFE5 will not provide accurate results if included gene families were not present in the most recent common ancestor (MRCA) of all taxa in the tree. For example, even if all taxa have a gene family size of 0, CAFE will assign the MRCA a gene family of size 1, and include the family in estimation of the birth and death rate. This difficulty does not affect analyses containing families that go extinct subsequent to the root node.

If a change in gene family size is very large on a single branch, CAFE5 may fail to provide accurate λ estimation and/or die during computation. To see if this is a problem, look at the likelihood scores computed during the λ search (reported in the log file if the job finishes). If ALL scores are “-inf” then there is a problem with large size changes and CAFE5 has calculated a probability of 0. Removing the family with the largest difference in size among species and rerunning CAFE5 should allow λ to be estimated on the remaining data. If the problem persists, remove the family with the next largest difference and proceed in a like manner until CAFE5 no longer finds families with zero probability. However, if rapidly evolving families are removed, care should be taken in interpretation of the estimated average rate of evolution for the remaining data.

In very large phylogenetic trees there can be many independent lambda parameters (2n - 2 in a rooted tree, where n is the number of taxa). CAFE5 does not always converge to a single global maximum with large numbers of λ parameters, and therefore can give misleading results. To check for this you should always run the λ search multiple times to ensure that the same estimated values are found. Also, the likelihood of models with more parameters should always be lower than models with fewer parameters, which may not be true if CAFE5has failed to find a global maximum. If CAFE does not converge over multiple runs, then one should reduce the number of parameters estimated and try again.

Logging

More verbose logging can be provided with an EasyLogging log config file. The file may look like this:

* GLOBAL:
   FORMAT               =  "%datetime %msg"
   FILENAME             =  "cafe.log"
   ENABLED              =  true
   TO_FILE              =  true
   TO_STANDARD_OUTPUT   =  true
   SUBSECOND_PRECISION  =  6
   PERFORMANCE_TRACKING =  true
   MAX_LOG_FILE_SIZE    =  2097152 ## 2MB - Comment starts with two hashes (##)
   LOG_FLUSH_THRESHOLD  =  100 ## Flush after every 100 logs
* DEBUG:
   FORMAT               = "%datetime{%d/%M} %func %msg"

For more information, see https://github.com/amrayn/easyloggingpp#using-configuration-file

Pass the config file to CAFE with the –log_config flag. For example,

cafe5 -t examples/mammals_tree.txt -i filtered_cafe_input.txt --log_config log.config      

Technical

How does the optimizer work?

The Nelder-Mead optimization algorithm is used. It runs until it can find a difference of less than 1e-6 in either the calculated score or the calculated value, or for 10,000 iterations. The parameters that are used for the optimizer are as follows:

In some cases, the optimizer suggests values that cannot be calculated (due to saturation, negative values, or other reasons) In this case, an infinite score is returned and the optimizer continues.

When optimizing for an alpha value with a set number of clusters, if the largest multiplier in the longest branch is saturated, the scorer will return an infinite value. This will be noted at the end of the run with text like:

90 values were attempted (10% rejected)

showing that 10

The following families had failure rates >20% of the time:
Family6 had 22 failures
Family9 had 19 failures

Certain options are available at compile-time for the optimizer. If OPTIMIZER_STRATEGY_INITIAL_VARIANTS is defined, the optimizer will take several shorter attempts at various values before settling on one value to continue on with. This may cause the optimizer to take more iterations to finish but may have greater accuracy. If OPTIMIZER_STRATEGY_PERTURB_WHEN_CLOSE is defined, the optimizer will begin searching a wider range of values when it is getting close to a solution. This attempts to get the optimizer out of a local optima it may have found.

How does the simulator choose what lambda to use?

Although the user specifies the lambda, in order to give more family variety a multiplier is selected every 50 simulated families. So if 10,000 families are being simulated, 200 different lambdas will be used.

When simulating without the gamma model, the multiplier is a random value based on a normal distribution with a mean of 1 and a standard deviation of 0.3.

When simulating WITH the gamma model, the multiplier is drawn directly from a gamma distribution based on the selected alpha and a mean of 1. If clustering is requested via the -k parameter, the selected cluster multiplier is modified by a normal distribution with the mean at the value of the multiplier, and a standard deviation intended to reflect the number of clusters requested.

Initial Guesses

One of the most important concerns when searching a parameter space is what initial values to choose. Each of the three values that CAFE can search for has a particular initial guess strategy. For lambda values, the formula is 1 / (longest tree branch times a random number between 0 and 1). For epsilon values, the initial guesses are taken directly from the provided error model. For gamma values, a random value taken from an exponential distribution is used.

In some situations the values may fail. In this case, the scorer will return an infinite value and the optimizer will retry initialization, up to a number of attempts determined at compile time. If, after this number of attempts, the optimizer continues to fail, it will abort. The most likely cause of failure is too wide a variety of species sizes inside certain families, and a message will be shown giving the most likely families to remove from the analysis for success.

Acknowledgements

Many people have contributed to the CAFE project, either by code or ideas. Thanks to:

CAFE uses the EasyLogging logging framework. https://github.com/amrayn/easyloggingpp

CAFE uses the DocTest testing framework. https://github.com/onqtam/doctest