augustus --species=fly --predictionStart=7000001 --predictionEnd=7500000 chr2R.fa > augustus.abinitio.gff # takes ~1mIn this example, I am using the fly parameters for comparability whith predictions below. Of course, the self-trained bug parameters also work. The output file augustus.abinitio.gff now contains the predicted gene structures in GFF format with additional comments (lines starting with #).
# This output was generated with AUGUSTUS (version 2.5). ... # start gene g1 chr2R AUGUSTUS gene 7007533 7010935 0.02 - . g1 chr2R AUGUSTUS transcript 7007533 7010935 0.02 - . g1.t1 chr2R AUGUSTUS tts 7007533 7007533 . - . transcript_id "g1.t1"; gene_id "g1"; chr2R AUGUSTUS exon 7007533 7008630 . - . transcript_id "g1.t1"; gene_id "g1"; chr2R AUGUSTUS stop_codon 7007610 7007612 . - 0 transcript_id "g1.t1"; gene_id "g1"; chr2R AUGUSTUS intron 7008631 7008694 1 - . transcript_id "g1.t1"; gene_id "g1"; chr2R AUGUSTUS intron 7008812 7008865 0.88 - . transcript_id "g1.t1"; gene_id "g1"; chr2R AUGUSTUS intron 7009192 7009251 0.95 - . transcript_id "g1.t1"; gene_id "g1"; chr2R AUGUSTUS CDS 7007610 7008630 1 - 1 transcript_id "g1.t1"; gene_id "g1"; chr2R AUGUSTUS CDS 7008695 7008811 0.88 - 1 transcript_id "g1.t1"; gene_id "g1"; chr2R AUGUSTUS exon 7008695 7008811 . - . transcript_id "g1.t1"; gene_id "g1"; chr2R AUGUSTUS CDS 7008866 7009191 0.99 - 0 transcript_id "g1.t1"; gene_id "g1"; chr2R AUGUSTUS exon 7008866 7009191 . - . transcript_id "g1.t1"; gene_id "g1"; chr2R AUGUSTUS CDS 7009252 7009353 0.95 - 0 transcript_id "g1.t1"; gene_id "g1"; chr2R AUGUSTUS exon 7009252 7009429 . - . transcript_id "g1.t1"; gene_id "g1"; chr2R AUGUSTUS start_codon 7009351 7009353 . - 0 transcript_id "g1.t1"; gene_id "g1"; chr2R AUGUSTUS exon 7010820 7010935 . - . transcript_id "g1.t1"; gene_id "g1"; chr2R AUGUSTUS tss 7010935 7010935 . - . transcript_id "g1.t1"; gene_id "g1"; # protein sequence = [MNTLSSARSVAIYVGPVRSSRSASVLAHEQAKSSITEEHKTYDEIPRPNKFKFMRAFMPGGEFQNASITEYTSAMRKR # YGDIYVMPGMFGRKDWVTTFNTKDIEMVFRNEGIWPRRDGLDSIVYFREHVRPDVYGEVQGLVASQNEAWGKLRSAINPIFMQPRGLRMYYEPLSNIN # NEFIERIKEIRDPKTLEVPEDFTDEISRLVFESLGLVAFDRQMGLIRKNRDNSDALTLFQTSRDIFRLTFKLDIQPSMWKIISTPTYRKMKRTLNDSL # NVAQKMLKENQDALEKRRQAGEKINSNSMLERLMEIDPKVAVIMSLDILFAGVDATATLLSAVLLCLSKHPDKQAKLREELLSIMPTKDSLLNEENMK # DMPYLRAVIKETLRYYPNGLGTMRTCQNDVILSGYRVPKGTTVLLGSNVLMKEATYYPRPDEFLPERWLRDPETGKKMQVSPFTFLPFGFGPRMCIGK # RVVDLEMETTVAKLIRNFHVEFNRDASRPFKTMFVMEPAITFPFKFTDIEQ] # end gene g1 ...For a description of the GFF format see the GFF definition at the Sanger Centre.
getAnnoFasta.pl augustus.abinitio.gffwhich extracts the peptide sequences into a file augustus.abinitio.aa:
>g1.t1 MNKLNLVLITEEHKTYDEIPRPNKFKFMRAFMPGGEFQNASITEYTSAMRKRYGDIYVMPGMFGRKDWVTTFNTKDIEMVFRNEGIWPRRDGLDSIVYFR EHVRPDVYGEVQGLVASQNEAWGKLRSAINPIFMQPRGLRMYYEPLSNINNEFIERIKEIRDPKTLEVPEDFTDEISRLVFESLGLVAFDRQMGLIRKNR DNSDALTLFQTSRDIFRLTFKLDIQPSMWKIISTPTYRKMKRTLNDSLNVAQKMLKENQDALEKRRQAGEKINSNSMLERLMEIDPKVAVIMSLDILFAG VDATATLLSAVLLCLSKHPDKQAKLREELLSIMPTKDSLLNEENMKDMPYLRAVIKETLRYYPNGLGTMRTCQNDVILSGYRVPKGTTVLLGSNVLMKEA TYYPRPDEFLPERWLRDPETGKKMQVSPFTFLPFGFGPRMCIGKRVVDLEMETTVAKLIRNFHVEFNRDASRPFKTMFVMEPAITFPFKFTDIEQ >g2.t1 MRHRNKGAVKRKGPSAGAEQEQELKKPKSEFSNGFKRYITEEHKTYDEIPRPNKFKFMRAFMPGGEFQNASITEYTSAMRKRYGDIYVMPGMFGRKDWVT ...
echo -e "browser position chr2R:7000000-7050000\n\ browser hide multiz15way bdtnpChipper\n\ track name=abinitio description=\"Augustus ab initio predictions\" db=dm3 visibility=3" > abinitio.browser grep -P "AUGUSTUS\tCDS" augustus.abinitio.gff >> abinitio.browserWith the grep command we just filtered out the lines that specify the coding exon coordinates.
browser position chr2R:7000000-7050000 browser hide multiz15way bdtnpChipper track name=abinitio description="Augustus ab initio predictions" db=dm3 visibility=3 chr2R AUGUSTUS CDS 7007610 7008630 1 - 1 transcript_id "g1.t1"; gene_id "g1"; chr2R AUGUSTUS CDS 7008695 7008811 0.88 - 1 transcript_id "g1.t1"; gene_id "g1"; chr2R AUGUSTUS CDS 7008866 7009191 0.99 - 0 transcript_id "g1.t1"; gene_id "g1"; chr2R AUGUSTUS CDS 7009252 7009353 0.95 - 0 transcript_id "g1.t1"; gene_id "g1"; chr2R AUGUSTUS CDS 7011376 7012396 1 - 1 transcript_id "g2.t1"; gene_id "g2"; chr2R AUGUSTUS CDS 7012461 7012577 0.92 - 1 transcript_id "g2.t1"; gene_id "g2"; chr2R AUGUSTUS CDS 7012632 7012957 0.99 - 0 transcript_id "g2.t1"; gene_id "g2"; ...
Now upload this file as a custom track on the UCSC genome browser.
[+] How again?Hints are extrinsic evidence about the location and struture of genes in a particular GFF format. When predicting genes AUGUSTUS can incorporate these hints, which will change the likelihood of gene structures candidates. It will tend to predict gene structures that are in agreement with the hints.
ESTs or mRNAs | transcriptome reads, long enough to span several exons (454, Sanger) |
RNA-Seq | high coverage short read transcriptome sequences (Illumina, SOLiD) |
genomic conservation | |
MSMS |
Below, we will practice the preparation of hints from ESTs or mRNA (3.1) and from RNA-Seq (3.2).
Align the ESTs against chr2R using BLAT.
blat -noHead chr2R.fa est.chr2R.7M-8M.fa est.psl # takes ~3mThis creates an alignment file est.psl in PSL format:
440 5 0 2 0 0 1 1281 - gi|1703783 447 0 447 chr2R 21146708 7776697 7778425 2 197,250, 0,197, 7776697,7778175, 494 3 0 1 0 0 2 65 + gi|1703784 498 0 498 chr2R 21146708 7775550 7776113 3 452,12,34, 0,452,464,x 7775550,7776003,7776079, ...However, typically some ESTs align well to very many places in the genome. BLAT also includes short local alignments starting from 30bp. For this reason, we further filter the alignments:
cat est.psl | filterPSL.pl --best --minCover=80 > est.f.pslest.f.psl now only contains for each query the best alignment(s) and that only if it covers at least 80% of the query length. This reduces the number of alignments:
wc -l est.psl est.f.psl # 10487 est.psl # 8606 est.f.pslWe can have a look at those EST alignments by creating another custom browser track:
echo -e "browser position chr2R:7000000-7050000\n\ track name=ESTs description=\"EST alignments\" db=dm3 visibility=4" > ests.browser cat est.f.psl >> ests.browser gzip ests.browser
You can now upload ests.browser.gz as another custom track or click on this prepared custom track
Next, generate hints from the EST alignments:blat2hints.pl --nomult --in=est.f.psl --out=hints.est.gffThe script blat2hints.pl identifies the positions of likely introns, exons and exonic regions (termed exonpart or ep) from the alignments. The file hints.est.gff now contains these hints sorted by third column. However, they are internally grouped by the the group name following grp= in the last column. An example group of hints may look like this.
grep 15058069 hints.est.gffyields
chr2R b2h ep 7559574 7559803 0 . . grp=gi|15058069;pri=4;src=E chr2R b2h ep 7560550 7560814 0 . . grp=gi|15058069;pri=4;src=E chr2R b2h exon 7560222 7560347 0 . . grp=gi|15058069;pri=4;src=E chr2R b2h intron 7559804 7560221 0 . . grp=gi|15058069;pri=4;src=E chr2R b2h intron 7560348 7560549 0 . . grp=gi|15058069;pri=4;src=E
Upload the file chr2R.7M-8M.wig as a UCSC custom track (gziping would speed up upload) or click on this prepared custom track.
Generate hints about exonic regions from the coverage graph (wig file):
cat chr2R.7M-8M.wig | wig2hints.pl --width=10 --margin=10 --minthresh=2 --minscore=4 \ --src=W --type=ep --radius=4.5 > hints.rnaseq.ep.gffThe resulting hints.rnaseq.ep.gff now contains hints in a GFF format that augustus understands:
chr2R w2h ep 7007551 7007560 5.300 . . src=W;mult=5; chr2R w2h ep 7007561 7007570 7.400 . . src=W;mult=7; chr2R w2h ep 7007571 7007580 9.700 . . src=W;mult=9; chr2R w2h ep 7007581 7007590 10.200 . . src=W;mult=10;Again, at the end of the column, the multiplicity (mult) contains the average coverage in the given interval. augustus will consider each such line evidence that the sequence interval is part of an exon (ep=exonpart).
cat hints.est.gff hints.rnaseq.intron.gff hints.rnaseq.ep.gff > hints.gffhints.gff now contains the exon, intron and exonpart hints from ESTs as well as the intron and exonpart hints from RNA-Seq.
Start by copying another extrinsic configuration file:
cp $AUGUSTUS_CONFIG_PATH/extrinsic/extrinsic.M.RM.E.W.cfg extrinsic.bug.cfgNow edit extrinsic.bug.cfg so that the non-comment lines are like this. Alternatively, you may just copy that file from the result files and edit some of the bold numbers below.
[SOURCES] M E W [SOURCE-PARAMETERS] [GENERAL] start 1 1 M 1 1e+100 E 1 1 W 1 1 stop 1 1 M 1 1e+100 E 1 1 W 1 1 tss 1 1 M 1 1e+100 E 1 1 W 1 1 tts 1 1 M 1 1e+100 E 1 1 W 1 1 ass 1 1 M 1 1e+100 E 1 1 W 1 1 dss 1 1 M 1 1e+100 E 1 1 W 1 1 exonpart 1 .997 M 1 1e+100 E 1 1e2 W 1 1.05 exon 1 1 M 1 1e+100 E 1 1e4 W 1 1 intronpart 1 1 M 1 1e+100 E 1 1 W 1 1 intron 1 .3 M 1 1e+100 E 1 1e6 W 1 1 CDSpart 1 1 0.985 M 1 1e+100 E 1 1 W 1 1 CDS 1 1 M 1 1e+100 E 1 1 W 1 1 UTRpart 1 1 .96 M 1 1e+100 E 1 1 W 1 1 UTR 1 1 M 1 1e+100 E 1 1 W 1 1 irpart 1 1 M 1 1e+100 E 1 1 W 1 1 nonexonpart 1 1 M 1 1e+100 E 1 1 W 1 1 genicpart 1 1 M 1 1e+100 E 1 1 W 1 1The bold green numbers specify the bonus that a gene structure candidate gets for being compatible with a hint of that type and source.
For example, the 1e6 in the intron row after the source letter E means that for each intron hint from ESTs (src=E), gene structures that have an intron with both boundaries given as in the hint are rewarded by a factor of 1 million relatively to gene structures disregarding the intron hint.
The 1.05 in the exonpart row after the letter W specifies that for each exonpart hint in the RNA-Seq hints file (src=W), every gene structure that has an exon including the range of the hint gets a relative bonus factor 1.05 per multiplicity.
The red numbers mean a punishment (malus) for gene structures with unsupported features. For example, the .3 in the intron row means that every intron candidate that has no intron hints supporting it is punished by multiplying its relative probability with the factor 0.3. If you decrease this number even more (say from .3 to .001) then fewer introns unsupported by spliced transcriptome reads should be predicted. This would likely decrease the false positive intron rate, but also more true unsupported introns would be missed.
For more information look at into one of the extrinsic.cfg files.augustus --species=fly --predictionStart=7000001 --predictionEnd=7500000 chr2R.fa \ --extrinsicCfgFile=extrinsic.bug.cfg --hintsfile=hints.gff > augustus.hints.gff # takes ~9m
The species fly contains UTR parameters, which we didn't have the time to train for bug. When using RNA-Seq as hints it is better to use a model with UTRs, as a significant fraction of reads map to UTRs. It is also possible to use bug here, though.
The output augustus.hints.gff now looks like that# This output was generated with AUGUSTUS (version 2.5). ... # start gene g1 chr2R AUGUSTUS gene 7007533 7009385 0.2 - . g1 chr2R AUGUSTUS transcript 7007533 7009385 0.2 - . g1.t1 chr2R AUGUSTUS tts 7007533 7007533 . - . transcript_id "g1.t1"; gene_id "g1"; chr2R AUGUSTUS exon 7007533 7008630 . - . transcript_id "g1.t1"; gene_id "g1"; chr2R AUGUSTUS stop_codon 7007610 7007612 . - 0 transcript_id "g1.t1"; gene_id "g1"; chr2R AUGUSTUS intron 7008631 7008694 1 - . transcript_id "g1.t1"; gene_id "g1"; chr2R AUGUSTUS intron 7008812 7008865 1 - . transcript_id "g1.t1"; gene_id "g1"; chr2R AUGUSTUS intron 7009192 7009251 1 - . transcript_id "g1.t1"; gene_id "g1"; chr2R AUGUSTUS CDS 7007610 7008630 1 - 1 transcript_id "g1.t1"; gene_id "g1"; chr2R AUGUSTUS CDS 7008695 7008811 1 - 1 transcript_id "g1.t1"; gene_id "g1"; chr2R AUGUSTUS exon 7008695 7008811 . - . transcript_id "g1.t1"; gene_id "g1"; chr2R AUGUSTUS CDS 7008866 7009191 1 - 0 transcript_id "g1.t1"; gene_id "g1"; chr2R AUGUSTUS exon 7008866 7009191 . - . transcript_id "g1.t1"; gene_id "g1"; chr2R AUGUSTUS CDS 7009252 7009353 0.94- 0 transcript_id "g1.t1"; gene_id "g1"; chr2R AUGUSTUS exon 7009252 7009385 . - . transcript_id "g1.t1"; gene_id "g1"; chr2R AUGUSTUS start_codon 7009351 7009353 . - 0 transcript_id "g1.t1"; gene_id "g1"; chr2R AUGUSTUS tss 7009385 7009385 . - . transcript_id "g1.t1"; gene_id "g1"; # protein sequence = [MNTLSSARSVAIYVGPVRSSRSASVLAHEQAKSSITEEHKTYDEIPRPNKFKFMRAFMPGGEFQNASITEYTSAMRKR # YGDIYVMPGMFGRKDWVTTFNTKDIEMVFRNEGIWPRRDGLDSIVYFREHVRPDVYGEVQGLVASQNEAWGKLRSAINPIFMQPRGLRMYYEPLSNIN # NEFIERIKEIRDPKTLEVPEDFTDEISRLVFESLGLVAFDRQMGLIRKNRDNSDALTLFQTSRDIFRLTFKLDIQPSMWKIISTPTYRKMKRTLNDSL # NVAQKMLKENQDALEKRRQAGEKINSNSMLERLMEIDPKVAVIMSLDILFAGVDATATLLSAVLLCLSKHPDKQAKLREELLSIMPTKDSLLNEENMK # DMPYLRAVIKETLRYYPNGLGTMRTCQNDVILSGYRVPKGTTVLLGSNVLMKEATYYPRPDEFLPERWLRDPETGKKMQVSPFTFLPFGFGPRMCIGK # RVVDLEMETTVAKLIRNFHVEFNRDASRPFKTMFVMEPAITFPFKFTDIEQ] # Evidence for and against this transcript: # % of transcript supported by hints (any source): 100 # CDS exons: 4/4 # E: 4 # W: 4 # CDS introns: 3/3 # E: 3 # 5'UTR exons and introns: 1/1 # E: 1 # 3'UTR exons and introns: 1/1 # W: 1 # hint groups fully obeyed: 137 # E: 4 (gi|15542574,SRR023546.8642467/1) # W: 133 # incompatible hint groups: 18 # E: 18 (gi|13769068,gi|4203815,gi|15543927,gi|38623822,gi|15539951,gi|14693753,gi|14699170,...) # end gene g1 ...
After each predicted transcript a little statistics follows about the support and compatibility of this transcript with the hints. Note, that AUGUSTUS now predicts alternative splice forms (ending e.g. in .t2).
Finally, make another custom track with the predictions using hints:echo -e "browser position chr2R:7299000-7318000\n\ track name=withhints description=\"Augustus predictions using hints\" db=dm3 visibility=3" > withhints.browser grep -P "AUGUSTUS\t(CDS|exon)" augustus.hints.gff >> withhints.browserIn the last line we are filering out from the predictions just the lines specifying exons and CDS. The additional exon lines identify the UTR (if you used fly) by subtracting the CDS ranges.
Again, you may also just click on this like to a prepared custom track with the preditions.