phold
is a sensitive annotation tool for bacteriophage
genomes and metagenomes using protein structural homology.
phold
uses the ProstT5 protein
language model to rapidly translate protein amino acid sequences to the
3Di token alphabet used by Foldseek. Foldseek
is then used to search these against a database of over 1 million phage
protein structures mostly predicted using Colabfold.
Alternatively, you can specify protein structures that you have
pre-computed for your phage(s) instead of using ProstT5 with
phold compare
.
The phold
databse consists of over 1 million protein
structures generated using Colabfold from the
following databases:
- PHROGs — 440549
de-deuplicated proteins. Proteins over 1000AA were chunked into 1000AA
components.
- ENVHOGs — 315k
representative proteins from the 2.2M ENVHOGs that have a PHROG function
that is not ‘unknown function’. Proteins over 1000AA were chunked into
1000AA components.
- EFAM -
262k efam “extra conservative” proteins. Proteins over 1000AA were
chunked into 1000AA components.
- DGRs -
12683 extra diversity generating element proteins from Roux et al
2021.
- VFDB — over 28k
structures of putative bacterial virulence factors from the VFDB.
- CARD — nearly 5k structures
of anitbiotic resistant proteins from CARD.
- acrDB - nearly 3.7k
anti-crispr proteins predicted in this study.
- Defensefinder - 455
monomer prokaryotic defense proteins.
- Netflax - 7153
toxin-antitoxin proteins.
Google Colab Notebook
If you don’t want to install phold
locally, you can run
it without any code using one this
Google Colab notebook