What Is Fasta Format?

FASTA is a textual format for nucleotide or polypeptide sequences in which nucleotides or amino acids are designated using one-letter codes. Due to its simplicity and practicality, it is currently used by most biological sequencing programs. Files of this format can contain sequence names, their identifiers in databases and comments. Depending on the nature of the biological sequences it contains, the FASTA file can have various extensions.

History

The format was invented by David Lipman and William Pearson in 1985 for the program of the same name designed to search large databases of sequences homologous to the given one. The primary description of the format was made by them in the documentation of this program, and now its description is part of the documentation of the BLAST program.

The simplicity of the FASTA format allows you to easily perform various actions with sequences using text editing tools and scripting programming languages such as Python, Ruby, Perl, Java.

The FASTA and FASTQ (Sanger Institute) formats are the most popular for the presentation of biological sequence data. There are also other formats, including those used in the GenBank, EMBL and UniProt.

Format

FASTA sequences begin with a one-line description followed by lines containing the sequence itself. The description is indicated by a greater than (>) symbol in the first column. The word following this character and up to the first space is the sequence identifier, followed by an optional description. The next few lines may have a semicolon (“;”) as the first character, and then they will be treated as comments. At the moment, many databases and programs do not recognize comments, so they are not widely used. This is followed by lines containing the actual biological sequences. Typically, FASTA strings are limited to 80 to 120 characters (for historical reasons), but modern programs recognize sequences written entirely on a single line. Several sequences can be written to one file, thus a multi-FASTA file is obtained, however, each sequence must have its own identifier. An example of one sequence in FASTA format:

> gi | 31563518 | ref | NP_852610.1 | microtubule-associated proteins 1A / 1B light chain 3A isoform b [Homo sapiens]
MKMRFFSSPCGKAAVDPADRCKEVQQIRDQHPSKIPVIIERYKGEKQLPVLDKTKFLVPDHVNMSELVKI
IRRRLQLNPTQAFFLLVNQHSMVSVSTPIADIYEQEKDEDGFLYMVYASQETFGFIRENE

The identifier for this sequence is gi | 31563518 | ref | NP_852610.1 |.

Sequences are written as one-letter nucleotide or amino acid codes, matching their standard IUB / IUPAC one-letter designations, in order from 5′- to 3′-end for nucleic acids and from N- to C-end for amino acids. spaces are allowed in them, characters can be either upper or lower case. Numbers, line terminators, and tabs are ignored by sequence programs.

Nucleic acids are indicated as follows:

Code	Meaning	Mnemonic
A	A	Adenine
C	C	Cytosine
G	G	Guanine
T	T	Thymine (5-methyluracil)
U	U	Uracil
R	A, G	Purines
Y	C, T, U	Pyrimidines
K	G, T, U	Ketone bases
M	A, C	Bases with amino groups (aMino)
S	C, G	Strong interaction in a complementary pair (three hydrogen bonds)
W	A, T, U	Weak interaction in a complementary pair (two hydrogen bonds)
B	not A (i.e. C, G, T, or U)	B comes after A
D	not C (i.e. A, G, T, or U)	D comes after C
H	not G (A, C, T or U)	H goes after G
V	not T and not U (A, C or G)	V follows U
N	A C G T U	Any (aNy) nucleotide

For amino acids, there are 22 common codes (canonical amino acids, selenocysteine and pyrrolysine), 4 special codes (designations for multiple amino acids) and * for a stop codon (in formal gene translations).

Amino acid code	Meaning
A	Alanine
B	Aspartic acid (D) or Asparagine (N)
C	Cysteine
D	Aspartic acid
E	Glutamic acid
F	Phenylalanine
G	Glycine
H	Histidine
I	isoleucine
J	Leucine (L) or Isoleucine (I)
K	Lysine
L	Leucine
M	Methionine
N	Asparagine
O	Pyrrolysine
P	Proline
Q	Glutamine
R	Arginine
S	Serine
T	Threonine
U	Selenocysteine
V	Valine
W	Tryptophan
Y	Tyrosine
Z	Glutamic acid (E) or Glutamine (Q)
X	Any amino acid
*	Broadcast termination

Fasta format is also used for files containing biological sequence alignments. In this case, “gaps” (usually a hyphen or a period) are inserted into each sequence in places corresponding to positions not presented in this sequence, as a result, all sequences in the file must have the same length.

Sequence identifiers

The NCBI has defined rules for generating unique sequence identifiers (SeqIDs). The following variants of identifiers are allowed in the description line:

Type	Format(s)	Example(s)
Local (does not refer to external databases)	lcl \| integer lcl \| string	lcl \| 123 lcl \| hmm271
GenInfo skeleton sequence identifier	bbs \| integer	bbs \| 123
GenInfo backbone molecule type	bbm \| integer	bbm \| 123
GenInfo import ID	gim \| integer	gim \| 123
GenBank	gb \| access code \| locus	GB \| M73307 \| AGMA13GT
EMBL	emb \| access code \| locus	emb \| CAM43271.1 \|
PIR	pir \| access code \| name	pir \|\| G36364
SWISS-PROT	sp \| access code \| name	sp \| P01013 \| OVAX_CHICK
Patent	pat \| country \| patent \| sequence number	pat \| US \| RE33188 \| 1
Patent application	pgp \| country \| application number \| sequence number	pgp \| EP \| 0238993 \| 7
RefSeq	ref \| access code \| name	ref \| NM_010450.1 \|
Database link not from this list	gnl \| database \| integer gnl \| database \| string	gnl \| taxon \| 9606 gnl \| PID \| e1632
GenInfo Integrated Database	gi \| integer	gi\|21434723
DDBJ	dbj \| access code \| locus	dbj \| BAC85684.1 \|
PRF	prf \| access code \| name	prf \|\| 0806162C
PDB	pdb \| write \| chain	pdb \| 1I4L \| D
GenBank with annotations from third parties	tpg \| access code \| name	tpg \| BK003456 \|
EMBL with annotations from third parties	tpe \| access code \| name	tpe \| BN000123 \|
DDBJ with third party annotations	tpd \| access code \| name	tpd \| FAA00017 \|
TrEMBL	tr \| access code \| name	tr \| Q90RT2 \| Q90RT2_9HIV1

Vertical bars (“|”) in the list above are not separators, but part of the format. You can put identifiers in a row, separating them with lines. If any of the identifier fields is left empty, to ensure compatibility with programs, you must put two dashes in a row

File extensions

Fasta files can have different extensions depending on the nature of the biological data they contain.

Extension	Meaning	Notes
fasta	Fasta regular data	Any data fasta. Sometimes also .fa, .seq, .fsa, .fas
fna	abbr. from “fasta nucleic acid”	For describing nucleotide sequences
ffn	Nucleotide coding regions	Contain coding regions of genomes
faa	abbr. from “fasta amino acid”	Contain amino acid sequences. The mpfa extension is used when storing multiple proteins in a single file.
frn	Non-coding RNA in FASTA format	Contain non-coding RNAs in the DNA alphabet, e.g. tRNA, rRNA
afa, mfa	FASTA alignment (a for “alignment”, m for “multiple”)	Contain biological (nucleotide or amino acid) sequence alignments

Type	Format(s)	Example(s)
Local (does not refer to external databases)	lcl \| integer lcl \| string	lcl \| 123 lcl \| hmm271
GenInfo skeleton sequence identifier	bbs \| integer	bbs \| 123
GenInfo backbone molecule type	bbm \| integer	bbm \| 123
GenInfo import ID	gim \| integer	gim \| 123
GenBank	gb \| access code \| locus	GB \| M73307 \| AGMA13GT
EMBL	emb \| access code \| locus	emb \| CAM43271.1 \|
PIR	pir \| access code \| name	pir \|\| G36364
SWISS-PROT	sp \| access code \| name	sp \| P01013 \| OVAX_CHICK
Patent	pat \| country \| patent \| sequence number	pat \| US \| RE33188 \| 1
Patent application	pgp \| country \| application number \| sequence number	pgp \| EP \| 0238993 \| 7
RefSeq	ref \| access code \| name	ref \| NM_010450.1 \|
Database link not from this list	gnl \| database \| integer gnl \| database \| string	gnl \| taxon \| 9606 gnl \| PID \| e1632
GenInfo Integrated Database	gi \| integer	gi\|21434723
DDBJ	dbj \| access code \| locus	dbj \| BAC85684.1 \|
PRF	prf \| access code \| name	prf \|\| 0806162C
PDB	pdb \| write \| chain	pdb \| 1I4L \| D
GenBank with annotations from third parties	tpg \| access code \| name	tpg \| BK003456 \|
EMBL with annotations from third parties	tpe \| access code \| name	tpe \| BN000123 \|
DDBJ with third party annotations	tpd \| access code \| name	tpd \| FAA00017 \|
TrEMBL	tr \| access code \| name	tr \| Q90RT2 \| Q90RT2_9HIV1