gbk2ENA is a command line program written in Python that allows a standard Genbank file to be converted into an EMBL-like file suitable for submission to the European Nucleotide Archive (ENA). For more details about the sequence annotation format required for submitting to ENA, see ftp.ebi.ac.uk/pub/databases/embl/doc/usrman.txt.
Clone this repository with the following command line:
git clone https://gitlab.pasteur.fr/GIPhy/gbk2ENA.git
Verify that Python (2.7 or higher) is installed, as well as Biopython (1.43 or higher).
Execute the file gbk2ENA.py
available inside the src directory with the following command line model:
python gbk2ENA.py [options]
Launch gbk2ENA with option -h
to read the following documentation:
usage: gbk2ENA [-h] -i FILEINPUT -o FILEOUTPUT -p PROJECTID [-a AUTHORS]
[-t TITLE] [-s SEQTOPOLOGY] [-m MOLECULETYPE]
[-c DATACLASS] [-d TAXODIV]
This tool converts Genbank files into EMBL-like files for submission to ENA
optional arguments:
-h, --help show this help message and exit
-i FILEINPUT (mandatory) input file in genbank format
-o FILEOUTPUT (mandatory) output file name
-p PROJECTID (mandatory) project id (PR lines)
-a AUTHORS reference authors (RA lines); default: "Unknown"
-t TITLE reference title (RT lines); default: "N/A"
-s SEQTOPOLOGY sequence topology (ID token 3); default: "linear"
-m MOLECULETYPE molecule type (ID token 4); default: "genomic DNA"
-c DATACLASS data class (ID token 5); default: "STD"
(see 3.1 at ftp.ebi.ac.uk/pub/databases/embl/doc/usrman.txt)
-d TAXODIV taxonomic division (ID token 6); default: "UNC"
(see 3.2 at ftp.ebi.ac.uk/pub/databases/embl/doc/usrman.txt)
Below are some useful excerpts from ftp.ebi.ac.uk/pub/databases/embl/doc/usrman.txt.
Section 3.1, concerning the option -c
The data class of each entry, representing a methodological approach to the generation of the data or a type of data, is indicated on the first (ID) line of the entry. Each entry belongs to exactly one data class. Class Definition ----------- ----------------------------------------------------------- CON Entry constructed from segment entry sequences; if unannotated, annotation may be drawn from segment entries PAT Patent EST Expressed Sequence Tag GSS Genome Survey Sequence HTC High Thoughput CDNA sequencing HTG High Thoughput Genome sequencing MGA Mass Genome Annotation WGS Whole Genome Shotgun TSA Transcriptome Shotgun Assembly STS Sequence Tagged Site STD Standard (all entries not classified as above)
Section 3.2, concerning the option -d
The entries which constitute the database are grouped into taxonomic divisions, the object being to create subsets of the database which reflect areas of interest for many users. In addition to the division, each entry contains a full taxonomic classification of the organism that was the source of the stored sequence, from kingdom down to genus and species (see below). Each entry belongs to exactly one taxonomic division. The ID line of each entry indicates its taxonomic division, using the three letter codes shown below: Division Code ----------------- ---- Bacteriophage PHG Environmental Sample ENV Fungal FUN Human HUM Invertebrate INV Other Mammal MAM Other Vertebrate VRT Mus musculus MUS Plant PLN Prokaryote PRO Other Rodent ROD Synthetic SYN Transgenic TGN Unclassified UNC Viral VRL
Section 3.4.1, concerning the option -s
Sequence topology: 'circular' or 'linear'
Section 3.4.1 Note 1, concerning the option -m
Molecule type: this represents the type of molecule as stored and can be any value from the list of current values for the mandatory mol_type source qualifier. This item should be the same as the value in the mol_type qualifier(s) in a given entry.
The Genbank file F.columnare.PH-97028.gbk inside the directory example contains the annotated draft assembly of a Flavobacterium columnare strain (Criscuolo et al. 2018) created by the annotation program Prokka (Seemann et al. 2014). The following command line allows creating the file F.columnare.PH-97028.embl suitable for submission to the ENA under the project id PRJEB25044:
python gbk2ENA.py -i F.columnare.PH-97028.gbk -p PRJEB25044 -t "Draft genome of Flavobacterium columnare strain PH-97028 (= CIP 109753)" -d PRO -o F.columnare.PH-97028.embl
Criscuolo A, Chesneau O, Clermont D, Bizet C (2018) Draft genome sequence of the fish pathogen Flavobacterium columnare genomovar III strain PH-97028 (=CIP 109753). Genome Announcement, 6(14):e00222-18. doi:10.1128/genomeA.00222-18
Seemann T (2014) Prokka: rapid prokaryotic genome annotation. Bioinformatics, 30(14):2068-2069. doi:10.1093/bioinformatics/btu153