gbk2ENA

gbk2ENA is a command line program written in Python that allows a standard Genbank file to be converted into an EMBL-like file suitable for submission to the European Nucleotide Archive (ENA). For more details about the sequence annotation format required for submitting to ENA, see ftp.ebi.ac.uk/pub/databases/embl/doc/usrman.txt.

Installation and execution

Clone this repository with the following command line:

git clone https://gitlab.pasteur.fr/GIPhy/gbk2ENA.git

Verify that Python (2.7 or higher) is installed, as well as Biopython (1.43 or higher).

Execute the file gbk2ENA.py available inside the src directory with the following command line model:

python gbk2ENA.py [options]

Usage

Launch gbk2ENA with option -h to read the following documentation:

usage: gbk2ENA [-h] -i FILEINPUT -o FILEOUTPUT -p PROJECTID [-a AUTHORS]
               [-t TITLE] [-s SEQTOPOLOGY] [-m MOLECULETYPE]
               [-c DATACLASS] [-d TAXODIV]

This tool converts Genbank files into EMBL-like files for submission to ENA

optional arguments:
  -h, --help       show this help message and exit
  -i FILEINPUT     (mandatory) input file in genbank format
  -o FILEOUTPUT    (mandatory) output file name
  -p PROJECTID     (mandatory) project id (PR lines)
  -a AUTHORS       reference authors (RA lines);     default: "Unknown"
  -t TITLE         reference title (RT lines);       default: "N/A"
  -s SEQTOPOLOGY   sequence topology (ID token 3);   default: "linear"
  -m MOLECULETYPE  molecule type (ID token 4);       default: "genomic DNA"
  -c DATACLASS     data class (ID token 5);          default: "STD"
                   (see 3.1 at ftp.ebi.ac.uk/pub/databases/embl/doc/usrman.txt)
  -d TAXODIV       taxonomic division (ID token 6);  default: "UNC"
                   (see 3.2 at ftp.ebi.ac.uk/pub/databases/embl/doc/usrman.txt)

Below are some useful excerpts from ftp.ebi.ac.uk/pub/databases/embl/doc/usrman.txt.

Section 3.1, concerning the option -c

 The data class of each entry, representing a methodological approach to the 
 generation of the data or a type of data, is indicated on the first (ID) line 
 of the entry. Each entry belongs to exactly one data class.

  Class          Definition
  -----------    -----------------------------------------------------------
  CON            Entry constructed from segment entry sequences; if unannotated,
                 annotation may be drawn from segment entries
  PAT            Patent
  EST            Expressed Sequence Tag
  GSS            Genome Survey Sequence
  HTC            High Thoughput CDNA sequencing
  HTG            High Thoughput Genome sequencing
  MGA            Mass Genome Annotation
  WGS            Whole Genome Shotgun
  TSA            Transcriptome Shotgun Assembly
  STS            Sequence Tagged Site
  STD            Standard (all entries not classified as above)

Section 3.2, concerning the option -d

The entries which constitute the database are grouped into taxonomic divisions,
the object being to create subsets of the database which reflect areas of
interest for many users.
In addition to the division, each entry contains a full taxonomic
classification of the organism that was the source of the stored sequence,
from kingdom down to genus and species (see below).
Each entry belongs to exactly one taxonomic division. The ID line of each entry
indicates its taxonomic division, using the three letter codes shown below:

                     Division                 Code
                     -----------------        ----
                     Bacteriophage            PHG
                     Environmental Sample     ENV
                     Fungal                   FUN
                     Human                    HUM
                     Invertebrate             INV
                     Other Mammal             MAM
                     Other Vertebrate         VRT
                     Mus musculus             MUS 
                     Plant                    PLN
                     Prokaryote               PRO
                     Other Rodent             ROD
                     Synthetic                SYN
                     Transgenic               TGN
                     Unclassified             UNC
                     Viral                    VRL

Section 3.4.1, concerning the option -s

Sequence topology: 'circular' or 'linear' 

Section 3.4.1 Note 1, concerning the option -m

Molecule type: this represents the type of molecule as stored and can be
any value from the list of current values for the mandatory mol_type source 
qualifier. This item should be the same as the value in the mol_type 
qualifier(s) in a given entry.

Example

The Genbank file F.columnare.PH-97028.gbk inside the directory example contains the annotated draft assembly of a Flavobacterium columnare strain (Criscuolo et al. 2018) created by the annotation program Prokka (Seemann et al. 2014). The following command line allows creating the file F.columnare.PH-97028.embl suitable for submission to the ENA under the project id PRJEB25044:

python  gbk2ENA.py  -i F.columnare.PH-97028.gbk  -p PRJEB25044  -t "Draft genome of Flavobacterium columnare strain PH-97028 (= CIP 109753)"  -d PRO  -o F.columnare.PH-97028.embl 

References

Criscuolo A, Chesneau O, Clermont D, Bizet C (2018) Draft genome sequence of the fish pathogen Flavobacterium columnare genomovar III strain PH-97028 (=CIP 109753). Genome Announcement, 6(14):e00222-18. doi:10.1128/genomeA.00222-18

Seemann T (2014) Prokka: rapid prokaryotic genome annotation. Bioinformatics, 30(14):2068-2069. doi:10.1093/bioinformatics/btu153