Using BRAKER2 and AUGUSTUS

This page describes how to use BRAKER2 and AUGUSTUS for training of AUGUSTUS and for gene prediction.

General remarks

This tutorial is designed in a way that persons with no experience in Linux should be able to follow. In case someone gets bored, he or she can try to get the code of this tutorial on his own laptop running.
Here some manuals:

0. Preparation

Introduction to Linux

1. Navigate to the directory /home/mstanke on the genotoul server and display the content of the file .gm_key on your screen.

Helpful commands:

ssh Connect to a server via SSH. The command ssh genotoul.toulouse.inra.fr -l user_name should connect you to the genotoul server if you fill in your user name for user_name
pwd Print working directory. Prints out the absolute path of the directory where you are right now.
cd Change directory
- cd work Change to the directory work given that it is a sub-directory of your current directory.
- cd / Change to the root directory.
- cd ~ Change to your home directory.
- cd .. Change to the parent directory of your current directory.
- cd /usr Change to the directory usr which is located in the root directory.
ls List directory. Lists the content of your current directory.
- ls List the content of your current directory.
- ls work List the content of the directory work given it is a sub-directory of your current directory.
- ls -l Lists additional information about the content of your current directory
- ls -a ~ List all files in your home directory, including hidden ones, beginning with a dot.
- ll Short form for ls -l
man Display reference manual of a command. Try e.g. man ls
less Display a given file, allowing to navigate it. Try e.g. less .bashrc

2. Create a directory test in the directory work of your home directory. Create a file test.txt in this newly created directory. Edit this file using a text editor, i.e., write something into this file. Then delete the directory test including test.txt.

Helpful commands:

mkdir Make directory. Creates the given directory. E.g. mkdir abc creates the directory acb as a sub-directory of your current directory.
touch Change a file timestamp. The command touch file.txt changes the timestamp of a file file.txt to the current time and date. If file.txt does not exist yet, it is created.
emacs Opens the editor emacs.
- emacs -nw file.txt Opens the file file.txt with emacs in command line mode. That is, if you drop the option -nw, emacs is opened as a GUI, which does not work if you connected to a server using ssh without the option -X.
- Ctrl-x Ctrl-s Save the current file.
- Ctrl-x Ctrl-c Close the emacs.
rm Remove file / directory.
- rm file.txt Removes the file file.txt in your current directory
- rm dir Removes the directory dir given it is a sub-directory of your current directory and dir is empty.
- rm -r dir Removes the directory dir even it is not empty.

3. Execute the following commands

/usr/local/bioinfo/src/augustus/augustus-3.2.3/bin/augustus --version
augustus --version
./augustus --version
cd /usr/local/bioinfo/src/augustus/augustus-3.2.3/bin/; ./augustus --version

These commands do the following:

Execute augustus (with the option --version) using the executable augustus in /usr/local/bioinfo/src/augustus/augustus-3.2.3/bin
Execute augustus using the executable which linux finds by itself using the environment variable $PATH
Execute augustus using the executable augustus in the current directory. This fails because there is no executable augustus in the current directory.
Change the directory to cd /usr/local/bioinfo/src/augustus/augustus-3.2.3/bin and then execute augustus using the executable augustus in the current directory.

You can find out which file is executed with a given command using which:

which augustus
# /usr/local/bioinfo/src/augustus/current/bin/augustus

which exonerate
# /usr/local/bioinfo/bin/exonerate

Specific preparation

1. Copy the directory /home/mstanke/work/tutorial into your directory ~/work

Helpful commands:

cp Copy files / directory
- cp src.txt dest.txt Copies the file cp src.txt onto the file dest.txt
- cp -r src dest Copies the directory src including its entire content onto the directory dest

If not stated differently, it is assumed that the commands provided in this tutorial are executed in the directory ~/work/tutuorial.

2. To improve readability of the shell commands that we will use subsequently, generate the following symbolic links in the folder tutorial/data:

A link genome.fa for the file RCC4221_Ot_outlier.fasta
A link proteins.fa for the file osttaV2_active_pep_2015_ch02_ch19.tfa

Check whether your commands worked using less.

Helpful commands:

ln Make link between files. The command ln -s file.txt link.txt creates a symbolic link link.txt for the file file.txt. That is, you can use link.txt everywhere instead of file.txt.

3. Export some additional folders to the PATH variable. This has to be repeated if you close and re-open your terminal.

export PATH=$PATH:/usr/local/bioinfo/src/RepeatScout/current:/usr/local/bioinfo/src/GeneMark-ET/current/gmes_petap

1. Repeat-mask the genome

For

the prediction of genes using AUGUSTUS
the alignment of RNA-seq reads to the genome

we need to mask the repeats in the target genome. This involves several steps, in which we employ RepeatScout and RepeatMasker.

First, we compile a repeat library using RepeatScout.

mkdir -p masking
build_lmer_table -sequence data/genome.fa -freq masking/genome.freq
RepeatScout -sequence data/genome.fa -output masking/genome.repseq.fa -freq masking/genome.freq   # takes ~30s

The file masking/genome.freq contains a list of ostensible repeat sequences.

head -n 100 masking/genome.freq

AAAAAGTGTGAA	4	351047
AAAACATGTGAA	8	758672
AAAAAACCTGAA	5	1473666
AAAAAGTATGAA	4	1470478
AAAACTGATGAA	3	336101
AAAACTACTTTT	5	1478803
AAAAGTTATGAA	7	822634
AAAATCGCTGAA	3	819533
AAAACTTTGGAA	3	1424203
AAAATACATTTT	3	504310
...

If you want to have a look at the entire file, you can use

less masking/genome.freq

The repeat library generated by RepeatScout is stored in masking/genome.repseq.fa

>R=0
GCGCGCGCGCGCGTAGGCCACGCGAACGCCTCCACACGCGTCGTCGCGCTGTCACGGTGGGCAGTACACGTCGCCACGGG
CGCGTGATGATACGTGCGAGTGCGCGACGTAGAAATACGTGCGCACGCGACTCACGTCGGGTCACCCAAAGCTCGCGCGC
TTGGCCGCGTCGCCTCGCGTCCATCCCCAGGATGCGCTCGTCGTGACGCACACGCTGTCGAGACGTATTTGAAGTGCGTT
TTCACAGACGATCTGGTACGTGTGATGGACGAACGTCGCATAGGACACAGTGTCGGCTATGAAACACGCCCAGCGGCGAT
GCGGCGTCGTTCCGCGCGCGTCCATTTTGATGGGCGCGCGCGGGCGTCACGTCTGGAAACTGCGTGCGCGCGTCGCTCGA
...
>R=1
GGATTTTGACTAATCTCGGTGTTGGTTGTGCACACCTTTTCTTGAGTCATAATCCGTCACGTCATGTATGGGTAAACAGT
ATGATCGGGGTACTCCAACTTGTGTTTTACCTTGTACAAATTTCGGCAAGTTTTCGCAACACGCGACATGTGTGCGCGGA
TAACGCGCCGACGAGGCAGCGAGGACGTGCGCAGCCATGAACAGTTGTAGATACATCCATGTTACATCAGATTCATCCAG
TTTCTGTGGCGTACGACGGACATAGCGACGTCCTCACGCAGTTCCTCGCGCTCTGGTACTCCACGCCAAGCTCTCAAGAA
ACCACGTCTAAACCCTACCTAAACCCTACCTAAACCCTAGTTAGGGTTTATGTCACATGTACACCACGTGCACTACGCGC
...

Next, we remove repeats detected by RepeatScout that are too short or of low complexity.

cat masking/genome.repseq.fa | filter-stage-1.prl > masking/genome.repseq.f1.fa

Comparing the repeat library before and after the execution of filter-stage-1.prl shows that three repeats has been removed.

diff masking/genome.repseq.fa masking/genome.repseq.f1.fa

1c1
< >R=0
---
> >R=0 (RR=1.  TRF=0.015 NSEG=0.009)
169c169
< >R=1
---
> >R=1 (RR=2.  TRF=0.087 NSEG=0.000)

...

1087,1089c1087
< >R=82
< CGTATTATATCACTCATTGGAGCATAACGATACATATTACATCTGAACAC
< >R=83
---
> >R=83 (RR=83.  TRF=0.000 NSEG=0.000)

This can be confirmed with grep.

grep -c ">" masking/genome.repseq.fa masking/genome.repseq.f1.fa
# masking/genome.repseq.fa:156
# masking/genome.repseq.f1.fa:153

Next, we mask the genome with RepeatMasker, using the repeat library generated by RepeatScout.

RepeatMasker data/genome.fa -e ncbi -lib masking/genome.repseq.f1.fa -dir masking   # takes  ~1m

The file masking/genome.fa.masked contains the masked genome.

>ch02_Contig_2T
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNGCGACAAGAGACG
CACACAATGACAGACCCCGGCCAACATTGAGGGTGTACCAAGGCGCACAC
CACAAAACGAGGCACTAAGACACGTCTGGGGGAACCCTTATTCAAGAAAT
...

Then, we filter out repeats which do not occur often enough.

cat masking/genome.repseq.f1.fa | filter-stage-2.prl --cat=masking/genome.fa.out > masking/genome.repseq.f2.fa

grep -c ">" masking/genome.repseq.fa masking/genome.repseq.f1.fa masking/genome.repseq.f2.fa
# masking/genome.repseq.fa:156
# masking/genome.repseq.f1.fa:153
# masking/genome.repseq.f2.fa:36

Finally, we run RepeatMasker using the final repeat library. Since the alignment tool STAR needs a hard-masked genome, whereas AUGUSTUS and BRAKER need it soft-masked, we generate both a soft- and a hard-masked genome.

RepeatMasker data/genome.fa -e ncbi -lib masking/genome.repseq.f2.fa -dir masking   # takes ~1m
mv masking/genome.fa.masked masking/genome.fa.hardmasked
RepeatMasker data/genome.fa -e ncbi -lib masking/genome.repseq.f2.fa -dir masking -xsmall  # takes ~1m
mv masking/genome.fa.masked masking/genome.fa.softmasked

2. Alignment of RNA-seq reads

We use the first ??? reads from the RNA-seq library with the SRA accession number SRX3142169 for the gene prediction (run SRR5986291). As a preparatory step, we need to align these RNA-seq reads to our target genome. For this purpose, we use STAR. The application of STAR comprises two steps: First, the genome to which the reads will be aligned to is processed. Second, the reads are aligned and the result is output as a sorted BAM file.

mkdir -p star
STAR --runMode genomeGenerate --genomeDir star --genomeFastaFiles masking/genome.fa.hardmasked --genomeSAindexNbases 9 
STAR --genomeDir star --readFilesIn data/rna-seq.fa --outFileNamePrefix star/ --outSAMtype BAM SortedByCoordinate   # takes ~12m
mv star/Aligned.sortedByCoord.out.bam star/rna-seq.bam

You can have a look at the result with the command

samtools view star/rna-seq.bam | less

3. Gene prediction

As preparation, we carry out three steps. First, we set two environment variables. This has to be repeated if you close and re-open your terminal.

export GENEMARK_PATH=/usr/local/bioinfo/src/GeneMark-ET/current/gmes_petap
export BAMTOOLS_PATH=/usr/local/bioinfo/bin

Second, in order to use GeneMark a valid key file has to be put into the home directory. and copy configuration data of Augustus into the home directory.

cp /home/mstanke/.gm_key ~/.gm_key 
mkdir -p /work/tutorial/augustus
cp /home/mstanke/work/tutorial/augustus/config ~/work/tutorial/augustus/config

Finally, we make the species directory writable to everyone so that BRAKER and AUGUSTUS can write into them. Moreover, we remove two directories which if existed would lead to BRAKER and AUGUSTUS aborting once called. It is important to recall to remove these two directories when havnig to repeat the first call of BRAKER below in the situation that previous calls were not successful.

chmod a+w -R augustus/config/species 
rm -rf augustus/config/species/o.tauri
rm -rf braker

Now, we can begin with the gene prediction itself. For this purpose, we use BRAKER and AUGUSTUS.

The effect of our first call of BRAKER is trifold:

Generate a set of training genes, based on the RNA-seq data we prepared, using GeneMark
Train AUGUSTUS with these training genes
Predict genes using the RNA-seq data as hints using AUGUSTUS

/home/mstanke/work/BRAKER_v2.0.4+/braker.pl --species=o.tauri --genome=masking/genome.fa.softmasked --bam=star/rna-seq.bam --softmasking --skipOptimize --cores 1 \
--AUGUSTUS_SCRIPTS_PATH=/usr/local/bioinfo/src/augustus/augustus-3.2.3/scripts --AUGUSTUS_BIN_PATH=/usr/local/bioinfo/src/augustus/augustus-3.2.3/bin \
--AUGUSTUS_CONFIG_PATH=/home/[user name]/work/tutorial/augustus/config   # takes ~ 7m

The options in the BRAKE call have the follownig meanings:

--species: The name of the species on which we carry out our gene prediction.
The parameters AUGUSTUS infers from the training data and uses to parametrize its internal model for the gene prediction is located in AUGUSTUS_CONFIG_PATH/species/[value of --species]. The term "species" is used in this option although AUGUSTUS can also be used for gene predictions on different strains of the same species.
--genome: The genomic data on which the gene prediction is carried out.
--bam: The BAM file containign the aligned RNA-seq reads used for inference of the training genes by GeneMark and as hints for the gene prediction by AUGUSTUS
--softmasking: Flag indicating that the genome is softmasked.
It is possible to use a hardmasked genome but this leads to an inferior performance of AUGUSTUS.
--skipOptimize Flag indicating that AUGUSTUS should not try to optimize the parameters it derives from the training genes.
One should not use the parameters AUGUSTUS infers without optimization to generate a gene set used in a scientific project. We skip optimization here because it is very time-consuming
--cores: The number of cores used by BRAKER and AUGUSTUS.
To reduce running time, once can increase the number of cores. Nevertheless, for this session this is probably not advisable as the genotoul server only has 32 cores.
--AUGUSTUS_SCRIPTS_PATH, --AUGUSTUS_BIN_PATH, --AUGUSTUS_CONFIG_PATH: These paths specify where to find various files and executables.
When you install BRAKER and AUGUSTUS on a computer for which you have administrator rights, you very probably will not need to set these paths.

You can determine the number of training genes generated by BRAKER using the command

grep -c LOCUS braker/o.tauri/genbank.good.gb

We now carry out two further gene predictions, using hints differing form the ones just used in the prediction step. Nevertheless, to inspect the results just obtained, it is advisable to proceed to the next section and first visualize the results and then come back later to this section.

Next, we want to carry out a gene prediction not only using RNA-seq data as hints but also protein data. Since we have already trained AUGUSTUS in the first run of BRAKER, we can now reuse the output from the first run.

/home/mstanke/work/BRAKER_v2.0.4+/braker.pl --species=o.tauri --genome=masking/genome.fa.softmasked --hints=braker/o.tauri/hintsfile.gff --prot_seq=data/proteins.fa \
--prg=exonerate --softmasking --skipOptimize --skipAllTraining --cores 1 --workingdir braker_proteins \
--AUGUSTUS_SCRIPTS_PATH=/usr/local/bioinfo/src/augustus/augustus-3.2.3/scripts --AUGUSTUS_BIN_PATH=/usr/local/bioinfo/src/augustus/augustus-3.2.3/bin \
--AUGUSTUS_CONFIG_PATH=/home/[user name]/work/tutorial/augustus/config   # takes ~11m

The following options were not used in the first run:

--hints: In the first run we provided a BAM file as hints. This file was processed and converted into a hint file. Instead of passing the BAM file to BRAKER we therefore now can usen the hint file.
--prot_seq: The protein data used as hints by AUGUSTUS.
--prg: The program used to algin the proteins.
We use exonerate here out of convenience but GenomeThreader should be chosen for research projects.
--skipAllTraining: No training is done by GeneMark or AUGUSTUS.
--workingdir: The output directory used by BRAKER.
Since we do not want to override the results from the first run of BRAKER, we specify the output directory for this run.

Finally, we carry out an ab initio gene prediction using AUGUSTUS directly.

augustus --species=o.tauri masking/genome.fa.softmasked > augustus/augustus.abinitio.gff --AUGUSTUS_CONFIG_PATH=/home/[user name]/work/tutorial/augustus/config   # takes ~2m

Since we specify the species as o.tauri, AUGUSTUS automatically retrieves the parameter we estimated in the first run of BRAKER and uses them to carry out the gene prediction.

4. Visualization using UCSC browser

The results of the gene prediction can be displayed via an assembly hub of the UCSC browser. Go to https://genome-euro.ucsc.edu/cgi-bin/hgHubConnect, select the "My Hubs" tab, enter http://bioinf.uni-greifswald.de/bioinf/otauri/hub/hub.txt into the field "URL" and click "Add Hub". Once you habe been redirected to Genome Browser Gateway site, click "GO" in the top part of the page. This will show you the RNA-Seq coverage and, if you zoom in far enough, the nucleotides of the genome as well as their translation in all frames.

To display the results of the gene prediction and the data we used for it, click on "add custom track". Then click on the upper "Browse..." and click "Submit". Then click on the link "User Track" in the table and enter a meaningful name instead of "User Track". Click again "Submit" and then click "Go".

The GFF files to be displayed in the UCSC genome browser can be found here:

5. Miscellanous

An AUGUSTUS service allowing for parameter training and gene prediction is available on http://bioinf.uni-greifswald.de/webaugustus/.

The version of BRAKER used in this tutorial has been modified in a way that it is possible to use it with rna-seq data too small to obtain a gene prediction of reasonable quality. If a release version of BRAKER is applied to the tutorial data, this probably will result in BRAKER aborting with an error message.

The presentation Ingo Bulla gave during the morning session on 13 February 2018 in Banyuls can be found here.