Multiple Sequence Alignments (MSA)
ClustalW format
The ClustalW format is a relatively simple text file containing a single multiple sequence alignment of DNA, RNA, or protein sequences. It was first used as an output format for the clustalw programs, but nowadays it may also be generated by various other sequence alignment tools. The specification is straight forward:
The first line starts with the words:
CLUSTAL W
or:
CLUSTALW
After the above header there is at least one empty line
Finally, one or more blocks of sequence data are following, where each block is separated by at least one empty line.
Each line in a blocks of sequence data consists of the sequence name followed by the sequence symbols, separated by at least one whitespace character. Usually, the length of a sequence in one block does not exceed 60 symbols. Optionally, an additional whitespace separated cumulative residue count may follow the sequence symbols. Optionally, a block may be followed by a line depicting the degree of conservation of the respective alignment columns.
Note
Sequence names and the sequences must not contain whitespace characters!
Allowed gap symbols are the hyphen (-
), and dot (.
).
Warning
Please note that many programs that output this format tend to truncate the sequence names to a limited number of characters, for instance the first 15 characters. This can destroy the uniqueness of identifiers in your MSA.
Here is an example alignment in ClustalW format:
CLUSTAL W (1.83) multiple sequence alignment
AL031296.1/85969-86120 CUGCCUCACAACGUUUGUGCCUCAGUUACCCGUAGAUGUAGUGAGGGUAACAAUACUUAC
AANU01225121.1/438-603 CUGCCUCACAACAUUUGUGCCUCAGUUACUCAUAGAUGUAGUGAGGGUGACAAUACUUAC
AAWR02037329.1/29294-29150 ---CUCGACACCACU---GCCUCGGUUACCCAUCGGUGCAGUGCGGGUAGUAGUACCAAU
AL031296.1/85969-86120 UCUCGUUGGUGAUAAGGAACAGCU
AANU01225121.1/438-603 UCUCGUUGGUGAUAAGGAACAGCU
AAWR02037329.1/29294-29150 GCUAAUUAGUUGUGAGGACCAACU
Stockholm 1.0 format
Here is an example alignment in Stockholm 1.0 format:
# STOCKHOLM 1.0
#=GF AC RF01293
#=GF ID ACA59
#=GF DE Small nucleolar RNA ACA59
#=GF AU Wilkinson A
#=GF SE Predicted; WAR; Wilkinson A
#=GF SS Predicted; WAR; Wilkinson A
#=GF GA 43.00
#=GF TC 44.90
#=GF NC 40.30
#=GF TP Gene; snRNA; snoRNA; HACA-box;
#=GF BM cmbuild -F CM SEED
#=GF CB cmcalibrate --mpi CM
#=GF SM cmsearch --cpu 4 --verbose --nohmmonly -E 1000 -Z 549862.597050 CM SEQDB
#=GF DR snoRNABase; ACA59;
#=GF DR SO; 0001263; ncRNA_gene;
#=GF DR GO; 0006396; RNA processing;
#=GF DR GO; 0005730; nucleolus;
#=GF RN [1]
#=GF RM 15199136
#=GF RT Human box H/ACA pseudouridylation guide RNA machinery.
#=GF RA Kiss AM, Jady BE, Bertrand E, Kiss T
#=GF RL Mol Cell Biol. 2004;24:5797-5807.
#=GF WK Small_nucleolar_RNA
#=GF SQ 3
AL031296.1/85969-86120 CUGCCUCACAACGUUUGUGCCUCAGUUACCCGUAGAUGUAGUGAGGGUAACAAUACUUACUCUCGUUGGUGAUAAGGAACAGCU
AANU01225121.1/438-603 CUGCCUCACAACAUUUGUGCCUCAGUUACUCAUAGAUGUAGUGAGGGUGACAAUACUUACUCUCGUUGGUGAUAAGGAACAGCU
AAWR02037329.1/29294-29150 ---CUCGACACCACU---GCCUCGGUUACCCAUCGGUGCAGUGCGGGUAGUAGUACCAAUGCUAAUUAGUUGUGAGGACCAACU
#=GC SS_cons -----((((,<<<<<<<<<___________>>>>>>>>>,,,,<<<<<<<______>>>>>>>,,,,,))))::::::::::::
#=GC RF CUGCcccaCAaCacuuguGCCUCaGUUACcCauagguGuAGUGaGgGuggcAaUACccaCcCucgUUgGuggUaAGGAaCAgCU
//
See also…
WUSS notation for legal characters and
their interpretation in the consensus secondary structure line SS_cons
.
FASTA (Pearson) format
Note
Sequence names must not contain whitespace characters. Otherwise, the parts after
the first whitespace will be dropped. The only allowed gap character is the hyphen
(-
).
Here is an example alignment in FASTA format:
>AL031296.1/85969-86120
CUGCCUCACAACGUUUGUGCCUCAGUUACCCGUAGAUGUAGUGAGGGUAACAAUACUUAC
UCUCGUUGGUGAUAAGGAACAGCU
>AANU01225121.1/438-603
CUGCCUCACAACAUUUGUGCCUCAGUUACUCAUAGAUGUAGUGAGGGUGACAAUACUUAC
UCUCGUUGGUGAUAAGGAACAGCU
>AAWR02037329.1/29294-29150
---CUCGACACCACU---GCCUCGGUUACCCAUCGGUGCAGUGCGGGUAGUAGUACCAAU
GCUAAUUAGUUGUGAGGACCAACU
MAF format
The multiple alignment format (MAF) is usually used to store multiple alignments on DNA level between entire genomes. It consists of independent blocks of aligned sequences which are annotated by their genomic location. Consequently, an MAF formatted MSA file may contain multiple records. MAF files start with a line:
##maf
which is optionally extended by whitespace delimited key=value pairs. Lines starting with
the character (#
) are considered comments and usually ignored.
A MAF block starts with character (a
) at the beginning of a line, optionally followed
by whitespace delimited key=value
pairs. The next lines start with character (s
) and
contain sequence information of the form:
s src start size strand srcSize sequence
where:
src is the name of the sequence source
start is the start of the aligned region within the source (0-based)
size is the length of the aligned region without gap characters
strand is either (
+
) or (-
), depicting the location of the aligned region relative to the sourcesrcSize is the size of the entire sequence source, e.g. the full chromosome
sequence is the aligned sequence including gaps depicted by the hyphen (
-
)
Here is an example alignment in MAF format (bluntly taken from the UCSC Genome browser website):
##maf version=1 scoring=tba.v8
# tba.v8 (((human chimp) baboon) (mouse rat))
# multiz.v7
# maf_project.v5 _tba_right.maf3 mouse _tba_C
# single_cov2.v4 single_cov2 /dev/stdin
a score=23262.0
s hg16.chr7 27578828 38 + 158545518 AAA-GGGAATGTTAACCAAATGA---ATTGTCTCTTACGGTG
s panTro1.chr6 28741140 38 + 161576975 AAA-GGGAATGTTAACCAAATGA---ATTGTCTCTTACGGTG
s baboon 116834 38 + 4622798 AAA-GGGAATGTTAACCAAATGA---GTTGTCTCTTATGGTG
s mm4.chr6 53215344 38 + 151104725 -AATGGGAATGTTAAGCAAACGA---ATTGTCTCTCAGTGTG
s rn3.chr4 81344243 40 + 187371129 -AA-GGGGATGCTAAGCCAATGAGTTGTTGTCTCTCAATGTG
a score=5062.0
s hg16.chr7 27699739 6 + 158545518 TAAAGA
s panTro1.chr6 28862317 6 + 161576975 TAAAGA
s baboon 241163 6 + 4622798 TAAAGA
s mm4.chr6 53303881 6 + 151104725 TAAAGA
s rn3.chr4 81444246 6 + 187371129 taagga
a score=6636.0
s hg16.chr7 27707221 13 + 158545518 gcagctgaaaaca
s panTro1.chr6 28869787 13 + 161576975 gcagctgaaaaca
s baboon 249182 13 + 4622798 gcagctgaaaaca
s mm4.chr6 53310102 13 + 151104725 ACAGCTGAAAATA