Tutorial 

Codon usage analysis

Included with this distribution of codonW should be a test dataset of 
sequences (input.dat). We will use this set of sequence as a typical example 
of a codon usage analysis. This test dataset is derived from the open 
reading frames (ORFs) of Saccharomyces cerevisiae chromosome III as 
annotated in the EMBL feature table for the sequence entry SCCHRIII 
(accession number X59720). In the current EMBL (Release 51 June 1997) the 
number of annotated ORFs was 172. The file input.dat contains 111 of these 
ORFs. The rational and why some ORFs where removed is explained below.

The commandline syntax of codonW will be used in this tutorial, all options 
selected from the commandline are also selectable using the menu system. For 
more information please read the command line help (codonw -help) or just 
type "codonw" and use the menu specific online. 

Build your dataset of genes carefully.
Always remember that as in any analysis, but particularly with codon usage, 
GIGO (garbage in, garbage out). Examine as many sources of information about 
the data as possible, particularly the original publication and sequence 
annotations. It is important that the sequences are a representative sample. 
Five ORFs where removed from the dataset because they where annotated (and 
had sequence identity) with genes within the previously identified 
transposable elements Ty2 and Ty5.  These ORFs where annotated at positions 
1537-2127, 2118-2558, 2816-3742, 84714-86030, 84714-90384. The codon usage 
of transposable element genes differs from that of chromosomal genes. 

Further checks of sequence annotation was carried out, those sequences which 
had not been assigned gene names or SwissProt accession numbers where 
removed. The SwissProt annotation was also checked, genes described as 
hypothetical but which did not have any sequence identity with other 
proteins where removed. 
Check basic sequence integrity  
Sequences should be checked to confirm that they match some basic gene 
characteristics. Each sequence might reasonably be expected to have an 
initiation codon and a translation termination codon, and no internal stop 
codons. Those sequences that do not match these characteristics, or 
sequences that have partial codons or untranslatable codons are flagged by 
codonw with warning messages.

To make a first pass of the input data to check for simple sequence 
problems: 
codonw input.dat   -nomenu 		

By default codonw will report the codon usage of each gene to the file 
input.blk. As there are no problems with this dataset there should be no 
warning messages. However analysis of a previous version of this dataset 
based on EMBL Release 50 where SCCHRIII had 230 annotated ORFs, generated 
these typical warning messages.  

Warning: Sequence 178 "SCCHRIII.PE178______" does not begin with a 
recognised start codon
Warning: Sequence 178 "SCCHRIII.PE178______" is not terminated by a stop 
codon
Warning: Sequence 202 "SCCHRIII.PE202______" does not begin with a 
recognised start codon
Warning: Sequence 202 "SCCHRIII.PE202______" has 1 internal stop codon(s)
Warning: Sequence 202 "SCCHRIII.PE202______" is not terminated by a stop 
codon

Each sequence is labelled by its numerical occurrence in the input file 
(i.e. these are the 178th and 202nd sequences in the input file) and its 
sequence header line.

Sequences that generate warning messages should be examined closely to 
ascertain why. Some sequences may be annotated as partial sequences and 
therefore the absence of a start or stop codon or the presence of a 3' 
partial codon is to be expected. Note the presence of a 5' partial codon 
would cause a frame shift, it is ESSENTIAL that 5' partial codons are 
removed. Unless the frame shift that they produce, results in a (incorrect) 
reading frame that contains internal stop codons, codonw cannot detect this 
problem. The codon usage of a frame shifted gene sequence could adversely 
affect the correspondence analysis  (COA) (though such genes are often 
recognisable as being outliers on the COA plots).

If a sequence warning is due to incorrect annotation this should be 
corrected manually. Sequences that produce warnings that cannot be explained 
or justified (e.g. a gene with internal stop codon) should be excluded. 
These warning are informational only and do not exclude sequences from the 
analysis.

Codon usage indices 
Once the initial quality checks have been made for the data we can then 
proceed with the codon usage analysis (strictly speaking we can generate COA 
and codon usage indices tasks at the same time). Some of the indices of 
codon usage bias that CodonW calculates (i.e. Fop, CAI and CBI) use 
information about a preferred set of codons for highly expressed genes. This 
information is species specific and does not apply to all species (most 
eukaryotes and many prokaryotes appear to display no codon preference in 
highly expressed genes). Therefore care must be taken that the appropriate 
set of optimal codons are used. For most species the optimal codons are not 
know and therefore the indices should not be calculated at this stage. 
However this information is known for Saccharomyces cerevisiae, so we can 
immediately calculate these indices of codon usage. Later we will see how 
codonW identifies optimal codons and can generate this information for your 
species.  
 
The default optimal codons and codon adaptation values are those of E. coli. 
To select an alternative choice we use the c_type (for CAI values ) and 
f_type (for FOP/CBI) commandline arguments. These switches requires an 
integer values, this value is the same as the option number if we where 
using the menu system to change the codon information. 

Example       "-c_type 2" is equivalent to 
Choose "Main Menu"
Choose "Changes Defaults Menu"
Choose "Change the CAI values"
Choose "(2) Saccharomyces cerevisiae"

Example       "-f_type 4" is equivalent to 
Choose "Main Menu"
Choose "Changes Defaults Menu"
Choose "Change the Fop/CBI values"
Choose "(4) Saccharomyces cerevisiae"


Therefore to select all the codon usage indices calculated by codonw and to 
use the optimal codons of Saccharomyces cerevisiae type:

codonw input.dat  -all_indices  -c_type 2 -f_type 4 -nomenu

See below for the output of this command  
The commandline flag -nomenu by passes the menu system, the -all_indices 
indicates to codonw that you wish to calculate all the codon and amino acid 
usage indices. These indices areT3s, C3s, A3s, G3s, CAI, CBI, Fop, Nc, GC3s, 
GC, L_sym, L_aa, Gravy and Aromaticity. For a fuller explanation of what 
these indices are see Readme.indices. These indices can also be used to 
check whether there are any identical or almost identical sequences in the 
input file. If we sort the result file "input.out" we it is much easier to 
identify the sequences which are similar. 

sort -k 2n  input.out                 (unix for "sort using the second 
numerical field") 

The sorted output reveals the presence of two pairs of identical sequences 
(Mating type proteins)
ALPHA2____________63   0.3636  0.2273  0.4939  0.2177  0.109
MATALPHA2_________63   0.3636  0.2273  0.4939  0.2177  0.109
and 
ALPHA1____________52   0.4361  0.2180  0.4228  0.2589  0.112   
MATALPHA1_________52   0.4361  0.2180  0.4228  0.2589  0.112   

Sequences which appear to be multiple copies of the same gene are normally 
removed from our codon usage datasets, even if the sequences are not 
identical but where the differences can be attributed to sequencing error or 
allelic polymorphism.  However different sequences that appear to be members 
of the same multigene family are retained, even if identical.  As we know 
these ORFs are from different regions of the chromosome and are not the same 
sequence they where not removed from the sample dataset.

Tabulation of Overall Codon Usage 
A common representation of codon usage is a tabular format of the total 
codon usage of a dataset. CodonW can automatically generate this table for 
you.

codonw input.dat  -cutot 

The tabulated total codon usage is stored in the output file input.blk.

See Tabulated codon usage below. 



Preliminary codon usage analysis
The effective number of codons index, is a very useful preliminary tool for 
codon usage analysis [Wright, 1990 #24]. It is a simple measure of codon 
bias, analogous to the effective number of allele measure used in population 
genetics. It gives the number of equally used codon that would generate the 
same codon usage bias as observed, lower values indicate stronger bias. A 
useful feature of ENc is that the affect of GC biases have on the index can 
be estimated. This allows the comparison of GC3s and ENc against the 
theoretical values if codon bias was simply caused due to GC mutational 
bias. A plot of ENc vs. GC3s can be seen at 
http://www.molbiol.ox.ac.uk/cu/EncVsGC3s.gif. Although the majority of genes 
in this plot have a degree of codon bias that can be explained in terms of 
GC mutation, the cluster of genes (six genes with ENc <40) which have much 
stronger codon bias than be simply explained in terms of mutational biases. 
These genes are good candidates as genes whose codon usage has been 
determined by natural selection, probably selection for translational 
efficiency. 

Correspondence Analysis (COA)
We are now ready to generate a correspondence analysis of the codon usage of 
SCCHRIII genes. We have a choice about how much information is generated. In 
this example we will use the default values.

codonw input.dat -coa_cu  -nomenu -silent		(-silent stops all 
prompting)

This generates a COA of codon usage. The summary file is "summary.coa" and 
contains most of the data generated by the COA. One of the first sections is 
the "Explanation of the variation by axis"  also stored in eigen.coa. 

The total inertia of the data was 0.263176
Num. Eigenval.   R.Iner.  R.Sum    |Num. Eigenval.   R.Iner.  R.Sum  |
01   +4.5755E-02 +0.1739 +0.1739   |02   +3.2372E-02 +0.1230 +0.2969 |
03   +1.8405E-02 +0.0699 +0.3668   |04   +1.2499E-02 +0.0475 +0.4143 |

The relative inertia explained by the first axis  is 17.4%, the 2nd axis 
explains 12.3%, the 3rd 7.0%, etc.  (17.45% is not remarkably high for 
relative inertia explained by the first axis, but as there are ORFs included 
which are described as hypothetical there may be random noise present in the 
data if they are not real). 

The next two sections report position of each gene and codon on the trends. 

label                    Axis1      Axis2      Axis3      Axis4
1_YCG9_Probable_____    0.00904    0.13153    0.34028   -0.05372
2_YCG8________573_re    0.07429   -0.24652   -0.05502   -0.39837
3_ALPHA2________633_    0.30675    0.04259   -0.22864   -0.03878
4_ALPHA1________528_    0.16444    0.00399   -0.02000    0.00937
5_CHA1_________1083_   -0.00322    0.10387    0.07137    0.11896

this information is best viewed graphically, an example of the location of 
the genes on the two principal axes can be seen here 
http://www.molbiol.ox.ac.uk/cu/axes.gif.

Automatic Identification of Putative Optimal Codons 
Codonw automatically tries to identify the optimal codons in your data, or 
more precisely identify the codons which contribute to the major trend (if 
the main trend is selection for translational optimality these should be the 
optimal codons). It does this by comparing the codon usage of groups of 
genes taken from each extreme of the principle trend (axis 1). It identifies 
the set of genes with the highest bias (using the effective number of codons 
index) and tests for significant differences in the codon usage of between 
the higher bias set with a two way Chi-squared contingency test. The 
putative optimal codons are listed in summary.coa and hilo.coa.  It is the 
responsibility of the user to confirm that the major codon usage trend is 
selection for translational optimality, and not due to some other mutational 
pressure (see GC variation).  The number of genes included in the two groups 
can be selected using the command line switch ( -coa_num ) as an absolute 
number of genes, of a percentage of the total genes in the dataset (by 
default 5%).  

The analysis of this dataset identified 19 codons that appeared to be 
optimal. 18 of these agree with optimal codon identified previously using a 
larger dataset set of 575 genes [Sharp, 1991 #46]. The codon identified in 
this analysis as being optimal but not in the previous analysis, was GCC; 
this codon has been previously suggested as being an optimal codon in S. 
cerevisiae [Bennetzen, 1982 #92]. The U ending codons, AUU, GUU and UGU, 
which have been previously identified as optimal [Sharp, 1991 #46], where 
not identified here at p<0.01; although UGU was identified as potentially 
optimal with a p<0.02. The main reason that the U ending codons where not 
identified from this dataset was their much higher usage in the lower biased 
dataset. 


Caveats
1) The codons identified by codonw, as being optimal will be dependent on 
the strength of the trend and the size of the datasets. 
2) The composition of the genes from chromosome III is quite different from 
the 575-gene dataset used by Sharp and Cowe. Only one of the 30 genes they 
considered to be highly expressed, and none of the genes they considered 
lowly expressed are present in this dataset. The reader is reminded that 
there are approximately 15,000 yeast genes, so just a little over 1% are 
located on chromosome III.

Codonw generated personal choice of codons 
On the assumption that the principle trend identified by codonw is selection 
for translational optimality, and that the genes assigned to the highly bias 
codon usage group are highly expressed, codonw outputs files with the 
"optimal codons" and "CAI adaptation fitness values". These files are 
fop.coa, cbi.coa and cai.coa, their filenames are related to the index they 
have been formatted for.  These files can be used to calculate the indices 
in species where the preferred codon usage has not been hardwired into 
codonW.

codonw input.dat  -fop_file fop.coa 
codonw input.day -cai_file cai.coa -cbi_file cbi.coa

Caveats 
1) The original CAI paper calculated fitness values from experimentally 
determined highly expressed genes.  The fitness values that are internal 
to codonW where derived from these criteria. CAI indices calculated using 
fitness values derived from genes identified solely by COA, as being 
highly expressed should not be regarded as true CAI values.
2) The optimal codons stored in the files cbi.coa and fop.coa where 
identified by codonw using a statistical test of significance, this test 
is dependent on sample size. 
3) The size of the sample taken from the extremes of the axis will affect 
the identified optimal codons. 
4) The principle trend in the variation of codon usage may not be 
translation optimality. 

When we calculate the indexes CAI, CBI and Fop using the "codonw" generated 
optimal codons and fitness values based on this small dataset, as we would 
expect differ from when these indices are calculated using the codonw 
internal codon usage information for S. cerevisiae. The internal values are 
more accurate because the datasets used to generate them where larger, and 
contained experimentally verified gene sequences.

Although the two sets of indices differ, they remain highly correlated, all 
three indices have correlation coefficients greater than 0.96. Therefore if 
comparisons between the index values are internally consistent (i.e. they 
where both calculated using the same optimal codon information) relative 
comparisons of codon usage and bias can be made. Based on a dataset of 111 
genes we have been able to identify optimal codons, which give us some 
insight into the codon usage of S. cerevisiae. 


Axis2 is highly correlated with GC3s content
Alternative datasets could have been chosen that would present a much 
simpler analyses of codon usage  (i.e. where the optimal codons identified 
better matched those previously published). This dataset was specifically 
chosen as the codon usage variation for genes from this chromosome is know 
to have a second trend,  GC3s varies with chromosomal location in a 
systematic fashion [Sharp, 1993 #39]. When we examine correlation 
coefficients between the first 4 axes the correlation coefficient between 
axis2 and GC3s is highly significant (r=0.89). Interestingly the bias is most 
strong among the U ending codons it is possible that the presence of this 
trend contributed to why the three U ending codons where not identified here 
as optimal codons. This trend is quite strong accounting for 12.3% of the 
relative inertia of the data, the principle trend (apparently selection for 
translation optimality) accounted for 17.4%. We therefore see how it is 
possible that the strongest influence on the choice of codon usage might not 
be translation optimality but mutation biases.




Typical output from codonw -all_indices -nomenu 
======================= Output ======================================

Genetic code is currently set to Universal Genetic code TGA=* TAA=* TAG=*

 Welcome to CodonW  1.3  for Help type h

Using Saccharomyces cerevisiae (Sharp and Cowe (1991) Yeast 7:657-678) 
w values to calculate CAI 
Using Saccharomyces cerevisiae  (Sharp and Cowe (1991) Yeast 7:657-678) 
optimal codons to calculate CBI
Using Saccharomyces cerevisiae  (Sharp and Cowe (1991) Yeast 7:657-678)
optimal codons to calculate Fop
..................................................................

                Number of sequences: 111

Files used:
 Input  file was         input.dat 
 Output file was         input.out  (codon usage indices, e.g. gc3s)
 Output file was         input.blk  (bulk output e.g. raw codon usage)

 CodonW has finished
         ======================================================
Tabulation of total codon usage

Phe UUU 1483 1.14 Ser UCU 1094 1.47 Tyr UAU 1000 1.12 Cys UGU  434 1.18 
    UUC 1117 0.86     UCC  773 1.04     UAC  789 0.88     UGC  303 0.82 
Leu UUA 1349 1.55     UCA  882 1.19 TER UAA   47 1.27 TER UGA   36 0.97 
    UUG 1549 1.78     UCG  487 0.66     UAG   28 0.76 Trp UGG  665 1.00 

    CUU  698 0.80 Pro CCU  747 1.27 His CAU  677 1.15 Arg CGU  328 0.86 
    CUC  364 0.42     CCC  415 0.71     CAC  499 0.85     CGC  171 0.45 
    CUA  671 0.77     CCA  911 1.55 Gln CAA 1388 1.35     CGA  151 0.39 
    CUG  604 0.69     CCG  281 0.48     CAG  668 0.65     CGG  103 0.27 

Ile AUU 1612 1.35 Thr ACU 1052 1.38 Asn AAU 1778 1.17 Ser AGU  717 0.97 
    AUC 1018 0.85     ACC  660 0.87     AAC 1262 0.83     AGC  500 0.67 
    AUA  943 0.79     ACA  883 1.16 Lys AAA 2118 1.13 Arg AGA 1038 2.71 
Met AUG 1156 1.00     ACG  444 0.58     AAG 1645 0.87     AGG  504 1.32 

Val GUU 1184 1.49 Ala GCU 1055 1.40 Asp GAU 1905 1.25 Gly GGU 1284 1.87 
    GUC  674 0.85     GCC  765 1.01     GAC 1145 0.75     GGC  552 0.80 
    GUA  622 0.78     GCA  836 1.11 Glu GAA 2371 1.41     GGA  557 0.81 
    GUG  690 0.87     GCG  368 0.49     GAG  995 0.59     GGG  355 0.52 

53400 codons (used Universal Genetic code)
 
======================================================