Using BRAKER2 and AUGUSTUS

This page describes how to use BRAKER2 and AUGUSTUS for training of AUGUSTUS and for gene prediction.

General remarks


0. Preparation

Introduction to Linux

1. Navigate to the directory /home/mstanke on the genotoul server and display the content of the file .gm_key on your screen.

Helpful commands:
2. Create a directory test in the directory work of your home directory. Create a file test.txt in this newly created directory. Edit this file using a text editor, i.e., write something into this file. Then delete the directory test including test.txt.

Helpful commands:

3. Execute the following commands
/usr/local/bioinfo/src/augustus/augustus-3.2.3/bin/augustus --version
augustus --version
./augustus --version
cd /usr/local/bioinfo/src/augustus/augustus-3.2.3/bin/; ./augustus --version
These commands do the following:
  1. Execute augustus (with the option --version) using the executable augustus in /usr/local/bioinfo/src/augustus/augustus-3.2.3/bin
  2. Execute augustus using the executable which linux finds by itself using the environment variable $PATH
  3. Execute augustus using the executable augustus in the current directory. This fails because there is no executable augustus in the current directory.
  4. Change the directory to cd /usr/local/bioinfo/src/augustus/augustus-3.2.3/bin and then execute augustus using the executable augustus in the current directory.
You can find out which file is executed with a given command using which:
which augustus
# /usr/local/bioinfo/src/augustus/current/bin/augustus

which exonerate
# /usr/local/bioinfo/bin/exonerate

Specific preparation

1. Copy the directory /home/mstanke/work/tutorial into your directory ~/work

Helpful commands: If not stated differently, it is assumed that the commands provided in this tutorial are executed in the directory ~/work/tutuorial.

2. To improve readability of the shell commands that we will use subsequently, generate the following symbolic links in the folder tutorial/data: Check whether your commands worked using less.

Helpful commands:
3. Export some additional folders to the PATH variable. This has to be repeated if you close and re-open your terminal.
export PATH=$PATH:/usr/local/bioinfo/src/RepeatScout/current:/usr/local/bioinfo/src/GeneMark-ET/current/gmes_petap


1. Repeat-mask the genome

For we need to mask the repeats in the target genome. This involves several steps, in which we employ RepeatScout and RepeatMasker.

First, we compile a repeat library using RepeatScout.
mkdir -p masking
build_lmer_table -sequence data/genome.fa -freq masking/genome.freq
RepeatScout -sequence data/genome.fa -output masking/genome.repseq.fa -freq masking/genome.freq   # takes ~30s
The file masking/genome.freq contains a list of ostensible repeat sequences.
head -n 100 masking/genome.freq
AAAAAGTGTGAA	4	351047
AAAACATGTGAA	8	758672
AAAAAACCTGAA	5	1473666
AAAAAGTATGAA	4	1470478
AAAACTGATGAA	3	336101
AAAACTACTTTT	5	1478803
AAAAGTTATGAA	7	822634
AAAATCGCTGAA	3	819533
AAAACTTTGGAA	3	1424203
AAAATACATTTT	3	504310
...
If you want to have a look at the entire file, you can use
less masking/genome.freq
The repeat library generated by RepeatScout is stored in masking/genome.repseq.fa
>R=0
GCGCGCGCGCGCGTAGGCCACGCGAACGCCTCCACACGCGTCGTCGCGCTGTCACGGTGGGCAGTACACGTCGCCACGGG
CGCGTGATGATACGTGCGAGTGCGCGACGTAGAAATACGTGCGCACGCGACTCACGTCGGGTCACCCAAAGCTCGCGCGC
TTGGCCGCGTCGCCTCGCGTCCATCCCCAGGATGCGCTCGTCGTGACGCACACGCTGTCGAGACGTATTTGAAGTGCGTT
TTCACAGACGATCTGGTACGTGTGATGGACGAACGTCGCATAGGACACAGTGTCGGCTATGAAACACGCCCAGCGGCGAT
GCGGCGTCGTTCCGCGCGCGTCCATTTTGATGGGCGCGCGCGGGCGTCACGTCTGGAAACTGCGTGCGCGCGTCGCTCGA
...
>R=1
GGATTTTGACTAATCTCGGTGTTGGTTGTGCACACCTTTTCTTGAGTCATAATCCGTCACGTCATGTATGGGTAAACAGT
ATGATCGGGGTACTCCAACTTGTGTTTTACCTTGTACAAATTTCGGCAAGTTTTCGCAACACGCGACATGTGTGCGCGGA
TAACGCGCCGACGAGGCAGCGAGGACGTGCGCAGCCATGAACAGTTGTAGATACATCCATGTTACATCAGATTCATCCAG
TTTCTGTGGCGTACGACGGACATAGCGACGTCCTCACGCAGTTCCTCGCGCTCTGGTACTCCACGCCAAGCTCTCAAGAA
ACCACGTCTAAACCCTACCTAAACCCTACCTAAACCCTAGTTAGGGTTTATGTCACATGTACACCACGTGCACTACGCGC
...


Next, we remove repeats detected by RepeatScout that are too short or of low complexity.
cat masking/genome.repseq.fa | filter-stage-1.prl > masking/genome.repseq.f1.fa
Comparing the repeat library before and after the execution of filter-stage-1.prl shows that three repeats has been removed.
diff masking/genome.repseq.fa masking/genome.repseq.f1.fa
1c1
< >R=0
---
> >R=0 (RR=1.  TRF=0.015 NSEG=0.009)
169c169
< >R=1
---
> >R=1 (RR=2.  TRF=0.087 NSEG=0.000)

...

1087,1089c1087
< >R=82
< CGTATTATATCACTCATTGGAGCATAACGATACATATTACATCTGAACAC
< >R=83
---
> >R=83 (RR=83.  TRF=0.000 NSEG=0.000)
This can be confirmed with grep.
grep -c ">" masking/genome.repseq.fa masking/genome.repseq.f1.fa
# masking/genome.repseq.fa:156
# masking/genome.repseq.f1.fa:153


Next, we mask the genome with RepeatMasker, using the repeat library generated by RepeatScout.
RepeatMasker data/genome.fa -e ncbi -lib masking/genome.repseq.f1.fa -dir masking   # takes  ~1m
The file masking/genome.fa.masked contains the masked genome.
>ch02_Contig_2T
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNGCGACAAGAGACG
CACACAATGACAGACCCCGGCCAACATTGAGGGTGTACCAAGGCGCACAC
CACAAAACGAGGCACTAAGACACGTCTGGGGGAACCCTTATTCAAGAAAT
...


Then, we filter out repeats which do not occur often enough.
cat masking/genome.repseq.f1.fa | filter-stage-2.prl --cat=masking/genome.fa.out > masking/genome.repseq.f2.fa
grep -c ">" masking/genome.repseq.fa masking/genome.repseq.f1.fa masking/genome.repseq.f2.fa
# masking/genome.repseq.fa:156
# masking/genome.repseq.f1.fa:153
# masking/genome.repseq.f2.fa:36


Finally, we run RepeatMasker using the final repeat library. Since the alignment tool STAR needs a hard-masked genome, whereas AUGUSTUS and BRAKER need it soft-masked, we generate both a soft- and a hard-masked genome.
RepeatMasker data/genome.fa -e ncbi -lib masking/genome.repseq.f2.fa -dir masking   # takes ~1m
mv masking/genome.fa.masked masking/genome.fa.hardmasked
RepeatMasker data/genome.fa -e ncbi -lib masking/genome.repseq.f2.fa -dir masking -xsmall  # takes ~1m
mv masking/genome.fa.masked masking/genome.fa.softmasked


2. Alignment of RNA-seq reads

We use the first ??? reads from the RNA-seq library with the SRA accession number SRX3142169 for the gene prediction (run SRR5986291). As a preparatory step, we need to align these RNA-seq reads to our target genome. For this purpose, we use STAR. The application of STAR comprises two steps: First, the genome to which the reads will be aligned to is processed. Second, the reads are aligned and the result is output as a sorted BAM file.
mkdir -p star
STAR --runMode genomeGenerate --genomeDir star --genomeFastaFiles masking/genome.fa.hardmasked --genomeSAindexNbases 9 
STAR --genomeDir star --readFilesIn data/rna-seq.fa --outFileNamePrefix star/ --outSAMtype BAM SortedByCoordinate   # takes ~12m
mv star/Aligned.sortedByCoord.out.bam star/rna-seq.bam
You can have a look at the result with the command
samtools view star/rna-seq.bam | less


3. Gene prediction

As preparation, we carry out three steps. First, we set two environment variables. This has to be repeated if you close and re-open your terminal.
export GENEMARK_PATH=/usr/local/bioinfo/src/GeneMark-ET/current/gmes_petap
export BAMTOOLS_PATH=/usr/local/bioinfo/bin
Second, in order to use GeneMark a valid key file has to be put into the home directory. and copy configuration data of Augustus into the home directory.
cp /home/mstanke/.gm_key ~/.gm_key 
mkdir -p /work/tutorial/augustus
cp /home/mstanke/work/tutorial/augustus/config ~/work/tutorial/augustus/config
Finally, we make the species directory writable to everyone so that BRAKER and AUGUSTUS can write into them. Moreover, we remove two directories which if existed would lead to BRAKER and AUGUSTUS aborting once called. It is important to recall to remove these two directories when havnig to repeat the first call of BRAKER below in the situation that previous calls were not successful.
chmod a+w -R augustus/config/species 
rm -rf augustus/config/species/o.tauri
rm -rf braker


Now, we can begin with the gene prediction itself. For this purpose, we use BRAKER and AUGUSTUS.

The effect of our first call of BRAKER is trifold:
/home/mstanke/work/BRAKER_v2.0.4+/braker.pl --species=o.tauri --genome=masking/genome.fa.softmasked --bam=star/rna-seq.bam --softmasking --skipOptimize --cores 1 \
--AUGUSTUS_SCRIPTS_PATH=/usr/local/bioinfo/src/augustus/augustus-3.2.3/scripts --AUGUSTUS_BIN_PATH=/usr/local/bioinfo/src/augustus/augustus-3.2.3/bin \
--AUGUSTUS_CONFIG_PATH=/home/[user name]/work/tutorial/augustus/config   # takes ~ 7m
The options in the BRAKE call have the follownig meanings:
You can determine the number of training genes generated by BRAKER using the command
grep -c LOCUS braker/o.tauri/genbank.good.gb	


We now carry out two further gene predictions, using hints differing form the ones just used in the prediction step. Nevertheless, to inspect the results just obtained, it is advisable to proceed to the next section and first visualize the results and then come back later to this section.

Next, we want to carry out a gene prediction not only using RNA-seq data as hints but also protein data. Since we have already trained AUGUSTUS in the first run of BRAKER, we can now reuse the output from the first run.
/home/mstanke/work/BRAKER_v2.0.4+/braker.pl --species=o.tauri --genome=masking/genome.fa.softmasked --hints=braker/o.tauri/hintsfile.gff --prot_seq=data/proteins.fa \
--prg=exonerate --softmasking --skipOptimize --skipAllTraining --cores 1 --workingdir braker_proteins \
--AUGUSTUS_SCRIPTS_PATH=/usr/local/bioinfo/src/augustus/augustus-3.2.3/scripts --AUGUSTUS_BIN_PATH=/usr/local/bioinfo/src/augustus/augustus-3.2.3/bin \
--AUGUSTUS_CONFIG_PATH=/home/[user name]/work/tutorial/augustus/config   # takes ~11m	
The following options were not used in the first run:
Finally, we carry out an ab initio gene prediction using AUGUSTUS directly.
augustus --species=o.tauri masking/genome.fa.softmasked > augustus/augustus.abinitio.gff --AUGUSTUS_CONFIG_PATH=/home/[user name]/work/tutorial/augustus/config   # takes ~2m
Since we specify the species as o.tauri, AUGUSTUS automatically retrieves the parameter we estimated in the first run of BRAKER and uses them to carry out the gene prediction.


4. Visualization using UCSC browser

The results of the gene prediction can be displayed via an assembly hub of the UCSC browser. Go to https://genome-euro.ucsc.edu/cgi-bin/hgHubConnect, select the "My Hubs" tab, enter http://bioinf.uni-greifswald.de/bioinf/otauri/hub/hub.txt into the field "URL" and click "Add Hub". Once you habe been redirected to Genome Browser Gateway site, click "GO" in the top part of the page. This will show you the RNA-Seq coverage and, if you zoom in far enough, the nucleotides of the genome as well as their translation in all frames.

To display the results of the gene prediction and the data we used for it, click on "add custom track". Then click on the upper "Browse..." and click "Submit". Then click on the link "User Track" in the table and enter a meaningful name instead of "User Track". Click again "Submit" and then click "Go".

The GFF files to be displayed in the UCSC genome browser can be found here:

5. Miscellanous

An AUGUSTUS service allowing for parameter training and gene prediction is available on http://bioinf.uni-greifswald.de/webaugustus/.

The version of BRAKER used in this tutorial has been modified in a way that it is possible to use it with rna-seq data too small to obtain a gene prediction of reasonable quality. If a release version of BRAKER is applied to the tutorial data, this probably will result in BRAKER aborting with an error message.

The presentation Ingo Bulla gave during the morning session on 13 February 2018 in Banyuls can be found here.