Randomly split the set of annotated sequences in a training and a test set.
This generates a file genes.gb.test with 100 randomly chosen loci and a disjoint file genes.gb.train with the rest of the loci from genes.gb:randomSplit.pl genes.gb 100
grep -c LOCUS genes.gb* # genes.gb:492 # genes.gb.test:100 # genes.gb.train:392In order for the test accuracy to be statistically meaningful the test set should also be large enough (100-200 genes). You should split the set of gene structures really randomly! Do not just take the first and the last part of the file as then the test set is unlikely to be representative. The script randomSplit.pl is in the scripts directory.
We call parameters like the size of the window of the splice site models and the order of the Markov model meta parameters, in contrast to parameters like the distribution of
splice site patterns, the k-mer probabilities of coding and noncoding regions.
There are a few dozen meta parameters but many thousands of parameters. The meta parameters
determine how the parameters are calculated.
Create the files for training "bug" from a template.
new_species.pl uses the environment variable AUGUSTUS_CONFIG_PATH to determine the directory in which AUGUSTUS stores the species parameters. You should see a report like this:new_species.pl --species=bug
The file bug_parameters.cfg contains besides meta-parameters also parameters to augustus and etraining like defaults for output format settings.creating directory /home/mario/augustus/trunk/config/species/bug/ ... creating /home/mario/augustus/trunk/config/species/bug/bug_parameters.cfg ... creating /home/mario/augustus/trunk/config/species/bug/bug_weightmatrix.txt ... creating /home/mario/augustus/trunk/config/species/bug/bug_metapars.cfg ... ...
This creates/updates parameter files for exon, intron and intergenic region in the directory $AUGUSTUS_CONFIG_PATH/species/bug.etraining --species=bug genes.gb.train
now yieldsls -ort $AUGUSTUS_CONFIG_PATH/species/bug/
-rw------- 1 mario 810 Jun 23 16:48 bug_weightmatrix.txt -rw------- 1 mario 2057 Jun 23 16:48 bug_metapars.cfg -rw------- 1 mario 1356 Jun 23 16:48 bug_metapars.utr.cfg -rw-rw-r-- 1 mario 1125 Jun 23 16:48 bug_metapars.cgp.cfg -rw-rw-r-- 1 mario 7162 Jun 23 16:49 bug_parameters.cfg~ -rw-rw-r-- 1 mario 7163 Jun 23 16:50 bug_parameters.cfg -rw-rw-r-- 1 mario 350278 Jun 23 16:51 bug_intron_probs.pbl -rw-rw-r-- 1 mario 256132 Jun 23 16:51 bug_exon_probs.pbl -rw-rw-r-- 1 mario 32545 Jun 23 16:51 bug_igenic_probs.pblwhere bug_{intron,exon,igenic}.pbl are our newly created parameter files.
This predicts the genes in all 100 sequences and will at the end print a report about the prediction accuracy, comparing the strucures in the input file genes.gb.test with the ones predicted. Of course, for the predictions only the sequences are used, not the input gene structures.augustus --species=bug genes.gb.test | tee firsttest.out # takes ~1m
grep -A 22 Evaluation firsttest.out
******* Evaluation of gene prediction ******* ---------------------------------------------\ | sensitivity | specificity | ---------------------------------------------| nucleotide level | 0.873 | 0.626 | ---------------------------------------------/ ----------------------------------------------------------------------------------------------------------\ | #pred | #anno | | FP = false pos. | FN = false neg. | | | | total/ | total/ | TP |--------------------|--------------------| sensitivity | specificity | | unique | unique | | part | ovlp | wrng | part | ovlp | wrng | | | ----------------------------------------------------------------------------------------------------------| | | | | 253 | 101 | | | exon level | 484 | 332 | 231 | ------------------ | ------------------ | 0.696 | 0.477 | | 484 | 332 | | 35 | 0 | 218 | 36 | 0 | 65 | | | ----------------------------------------------------------------------------------------------------------/ ----------------------------------------------------------------------------\ transcript | #pred | #anno | TP | FP | FN | sensitivity | specificity | ----------------------------------------------------------------------------| gene level | 156 | 100 | 47 | 109 | 53 | 0.47 | 0.301 | ----------------------------------------------------------------------------/These numbers mean, for example, that
of the 100 genes 47 were predicted exactly
69.6% of the exons were predicted exactly
47.7% of the predicted exons were exactly as in the test set.
This script optimizes the prediction accuracy by adjusting the meta parameters in the *_parameters.cfg file. The script alternatingly used the programs augustus and etraining. This ususally increases prediction accuracy by a few percent points, but runs for hours or days. It may be skipped and only etraining be run once (step 3. above), which is very quick. augustus and etraining must be in the $PATH.
You need to tell optimize_augustus.pl, which metaparameters it should optimize.
Do this by adjusting the file config/species/generic/generic_metapars.cfg.
(You may also make a copy of it and then use the command line parameter
--metapars=nameofmycopy to the script optimize_augustus.pl.)
Run
optimize_augustus.pl --species=bug genes.gb.train # takes ~1d
[+]
What optimize_augustus.pl does...
After optimize_augustus.pl has finished or (after you have interrupted it) you should (re)train AUGUSTUS with the meta parameters it has set.
etraining --species=bug genes.gb.train
If you have a test set, you can now check the prediction accuracy on this test set by running
augustus --species=bug genes.test.gb
The end of the output will then contain a summary of the accuracy of
the prediction. If the gene level sensitivity is below 20% it is likely
that the training set is not large enough, that it doesn't have a good
quality or that the species is somehow 'special'.
If you succeeded in creating a good AUGUSTUS version for your
species I would be very interested in your results. If possible please
share your results by giving me the packed config/yourspecies folder.
4. SPECIAL CASE: ORGANISM WITH DIFFERENT GENETIC CODE
AUGUSTUS can be told to use a different translation table, in particular one with a different set of stop codons. This is useful for a small number of species such as Tetrahymena thermophilia, in which some codons translate to a different amino acid than usual. If you train AUGUSTUS for such a species set the variable translation_table in the parameter file of your species. Further, adjust the stop codon probabilities in the same config file. E.g. say
translation_table 6
/Constant/amberprob 0 # Prob(stop codon = tag), if 0 tag is assumed to code for amino acid
/Constant/ochreprob 0 # Prob(stop codon = taa), if 0 taa is assumed to code for amino acid
/Constant/opalprob 1 # Prob(stop codon = tga), if 0 tga is assumed to code for amino acid
Choose the translation table number according to this table. translation_table=1 is the default value and the standard with stop codons taa, tga, tag. If you have a species with the standard genetic code you don't have to do anything. In case your species' code is not covered by this table send us a note with the string of 64 one-letter amino acid codes in the codon order below.
translation | a | a | a | a | a | a | a | a | a | a | a | a | a | a | a | a | c | c | c | c | c | c | c | c | c | c | c | c | c | c | c | c | g | g | g | g | g | g | g | g | g | g | g | g | g | g | g | g | t | t | t | t | t | t | t | t | t | t | t | t | t | t | t | t |
table | a | a | a | a | c | c | c | c | g | g | g | g | t | t | t | t | a | a | a | a | c | c | c | c | g | g | g | g | t | t | t | t | a | a | a | a | c | c | c | c | g | g | g | g | t | t | t | t | a | a | a | a | c | c | c | c | g | g | g | g | t | t | t | t |
number | a | c | g | t | a | c | g | t | a | c | g | t | a | c | g | t | a | c | g | t | a | c | g | t | a | c | g | t | a | c | g | t | a | c | g | t | a | c | g | t | a | c | g | t | a | c | g | t | a | c | g | t | a | c | g | t | a | c | g | t | a | c | g | t |
1 | K | N | K | N | T | T | T | T | R | S | R | S | I | I | M | I | Q | H | Q | H | P | P | P | P | R | R | R | R | L | L | L | L | E | D | E | D | A | A | A | A | G | G | G | G | V | V | V | V | * | Y | * | Y | S | S | S | S | * | C | W | C | L | F | L | F |
2 | K | N | K | N | T | T | T | T | * | S | * | S | M | I | M | I | Q | H | Q | H | P | P | P | P | R | R | R | R | L | L | L | L | E | D | E | D | A | A | A | A | G | G | G | G | V | V | V | V | * | Y | * | Y | S | S | S | S | W | C | W | C | L | F | L | F |
3 | K | N | K | N | T | T | T | T | R | S | R | S | M | I | M | I | Q | H | Q | H | P | P | P | P | R | R | R | R | T | T | T | T | E | D | E | D | A | A | A | A | G | G | G | G | V | V | V | V | * | Y | * | Y | S | S | S | S | W | C | W | C | L | F | L | F |
4 | K | N | K | N | T | T | T | T | R | S | R | S | I | I | M | I | Q | H | Q | H | P | P | P | P | R | R | R | R | L | L | L | L | E | D | E | D | A | A | A | A | G | G | G | G | V | V | V | V | * | Y | * | Y | S | S | S | S | W | C | W | C | L | F | L | F |
5 | K | N | K | N | T | T | T | T | S | S | S | S | M | I | M | I | Q | H | Q | H | P | P | P | P | R | R | R | R | L | L | L | L | E | D | E | D | A | A | A | A | G | G | G | G | V | V | V | V | * | Y | * | Y | S | S | S | S | W | C | W | C | L | F | L | F |
6 | K | N | K | N | T | T | T | T | R | S | R | S | I | I | M | I | Q | H | Q | H | P | P | P | P | R | R | R | R | L | L | L | L | E | D | E | D | A | A | A | A | G | G | G | G | V | V | V | V | Q | Y | Q | Y | S | S | S | S | * | C | W | C | L | F | L | F |
9 | N | N | K | N | T | T | T | T | S | S | S | S | I | I | M | I | Q | H | Q | H | P | P | P | P | R | R | R | R | L | L | L | L | E | D | E | D | A | A | A | A | G | G | G | G | V | V | V | V | * | Y | * | Y | S | S | S | S | W | C | W | C | L | F | L | F |
10 | K | N | K | N | T | T | T | T | R | S | R | S | I | I | M | I | Q | H | Q | H | P | P | P | P | R | R | R | R | L | L | L | L | E | D | E | D | A | A | A | A | G | G | G | G | V | V | V | V | * | Y | * | Y | S | S | S | S | C | C | W | C | L | F | L | F |
11 | K | N | K | N | T | T | T | T | R | S | R | S | I | I | M | I | Q | H | Q | H | P | P | P | P | R | R | R | R | L | L | L | L | E | D | E | D | A | A | A | A | G | G | G | G | V | V | V | V | * | Y | * | Y | S | S | S | S | * | C | W | C | L | F | L | F |
12 | K | N | K | N | T | T | T | T | R | S | R | S | I | I | M | I | Q | H | Q | H | P | P | P | P | R | R | R | R | L | L | S | L | E | D | E | D | A | A | A | A | G | G | G | G | V | V | V | V | * | Y | * | Y | S | S | S | S | * | C | W | C | L | F | L | F |
13 | K | N | K | N | T | T | T | T | G | S | G | S | M | I | M | I | Q | H | Q | H | P | P | P | P | R | R | R | R | L | L | L | L | E | D | E | D | A | A | A | A | G | G | G | G | V | V | V | V | * | Y | * | Y | S | S | S | S | W | C | W | C | L | F | L | F |
14 | N | N | K | N | T | T | T | T | S | S | S | S | I | I | M | I | Q | H | Q | H | P | P | P | P | R | R | R | R | L | L | L | L | E | D | E | D | A | A | A | A | G | G | G | G | V | V | V | V | Y | Y | * | Y | S | S | S | S | W | C | W | C | L | F | L | F |
15 | K | N | K | N | T | T | T | T | R | S | R | S | I | I | M | I | Q | H | Q | H | P | P | P | P | R | R | R | R | L | L | L | L | E | D | E | D | A | A | A | A | G | G | G | G | V | V | V | V | * | Y | Q | Y | S | S | S | S | * | C | W | C | L | F | L | F |
16 | K | N | K | N | T | T | T | T | R | S | R | S | I | I | M | I | Q | H | Q | H | P | P | P | P | R | R | R | R | L | L | L | L | E | D | E | D | A | A | A | A | G | G | G | G | V | V | V | V | * | Y | L | Y | S | S | S | S | * | C | W | C | L | F | L | F |
21 | N | N | K | N | T | T | T | T | S | S | S | S | M | I | M | I | Q | H | Q | H | P | P | P | P | R | R | R | R | L | L | L | L | E | D | E | D | A | A | A | A | G | G | G | G | V | V | V | V | * | Y | * | Y | S | S | S | S | W | C | W | C | L | F | L | F |
22 | K | N | K | N | T | T | T | T | R | S | R | S | I | I | M | I | Q | H | Q | H | P | P | P | P | R | R | R | R | L | L | L | L | E | D | E | D | A | A | A | A | G | G | G | G | V | V | V | V | * | Y | L | Y | * | S | S | S | * | C | W | C | L | F | L | F |
23 | K | N | K | N | T | T | T | T | R | S | R | S | I | I | M | I | Q | H | Q | H | P | P | P | P | R | R | R | R | L | L | L | L | E | D | E | D | A | A | A | A | G | G | G | G | V | V | V | V | * | Y | * | Y | S | S | S | S | * | C | W | C | * | F | L | F |