bppseqman (BppSuite Manual 2.2.0)

4.10 BppSeqMan: Bio++ Sequence Manipulation

The Bio++ Sequence Manipulator convert between various file formats, and can also perform various operations on sequences. It uses the common options for setting the alphabet, loading the sequences (see Sequences) and writing the resulting data set (see WritingSequences). It can use the “Generic” option for alphabets if only file format conversion is to be performed, but the correct alphabet must be specified for more advanced manipulations, like in silico molecular biology.

BppSeqMan can perform any number of elementary operation, in any order, providing the output of operation n is compatible with input of operation n+1, and that the input of operation 1 is compatible with the input data.

Specific options:

sequence.manip = {list<string>}: The list, in appropriate order, of elementary operations to perform. See below for a list of these operations.

Complement [[alphabet = DNA or RNA]]: Convert to the complementary sequence, keeping the original alphabet.
Transcript [[alphabet = DNA or RNA]]: Convert to the complementary sequence, switching the type of alphabet (DNA<->RNA).
Switch [[alphabet = DNA or RNA]]: Change the alphabet type (DNA<->RNA).
Translate [[alphabet = DNA or RNA]]: Convert to proteins. The genetic code used for translation is set via the genetic_code option. Genetic code is set once for all sequences.
Invert: Invert the sequence 5’ <-> 3’ or N <-> C
RemoveGaps: Remove all gaps in sequences (ie, ’unalign’).
GapToUnknown: Change gaps to fully unresolved characters, N for nucleotides and X for proteins.
UnknownToGap: Change (partially) unresolved characters to gaps.
RemoveStops: Remove all stop codons in sequences. If sequences are aligned, stop codons will be replaced by gaps. The genetic code used for translation is set via the genetic_code option. Genetic code is set once for all sequences.
RemoveColumnsWithStops: Remove all sites with at least one stop codon. The genetic code used for translation is set via the genetic_code option. Genetic code is set once for all sequences.
GetCDS: Remove the first stop codon and everything after in codon sequences.
CoerceToAlignment: Try to convert a set of sequence to an alignment. This will fail if sequences do not have the same length. This step is required before trying commands ’ResolveDotted’ or ’KeepComplete’.
ResolveDotted(alphabet={RNA|DNA|Proteins}) [[Aligned sequences]]: Convert a human-readable alignment to a machine-readable alignment. This manipulation must be first if it is used, and the data must be load with the Generic alphabet. alphabet: The alphabet to use in order to resolve a dotted alignment.
KeepComplete(maxGapAllowed={int>0} or {float[0,100]}+%) [[Aligned sequences]]: Keep only complete sites, ie sites without any gap. Sites with unresolved characters are not removed. It is also possible to fix a maximum proportion of gaps, see specific options. maxGapAllowed: The maximum proportion of gaps allowed.
GetCodonPosition(position={1|2|3}): Retrieve the given positions from codon sequences (aligned or not).
FilterFromTree(tree.file={path}, tree.format={chars}): Get a subset of sequences based on a tree file. The order of sequences in the file will reflect the tree structure. All sequences which do not have a corresponding leaf in the tree, based on the sequence name, will be removed. This method can therefore be used for subsetting a list of sequences, and/or rearrange them in a more convenient manner.

Examples of use:

Just change file format:
```
sequence.manip=
```
Change DNA to RNA:
```
sequence.manip=Switch
```
Unalign sequences, perform transcription and translate to proteins:
```
sequence.manip=RemoveGaps,Transcript,Translate
```
Change all unresolved characters to gaps and keep only positions with less than 5 gaps:
```
sequence.manip=UnknownToGap,KeepComplete(maxGapAllowed=5)
```
Keep only positions with less than 30% of gaps, and change them to unresolved characters:
```
sequence.manip=KeepComplete(maxGapAllowed=30%),GapToUnknown
```