FASTA is a textual format for nucleotide or polypeptide sequences in which nucleotides or amino acids are designated using one-letter codes. Due to its simplicity and practicality, it is currently used by most biological sequencing programs. Files of this format can contain sequence names, their identifiers in databases and comments. Depending on the nature of the biological sequences it contains, the FASTA file can have various extensions.
The format was invented by David Lipman and William Pearson in 1985 for the program of the same name designed to search large databases of sequences homologous to the given one. The primary description of the format was made by them in the documentation of this program, and now its description is part of the documentation of the BLAST program.
The simplicity of the FASTA format allows you to easily perform various actions with sequences using text editing tools and scripting programming languages such as Python, Ruby, Perl, Java.
The FASTA and FASTQ (Sanger Institute) formats are the most popular for the presentation of biological sequence data. There are also other formats, including those used in the GenBank, EMBL and UniProt.
FASTA sequences begin with a one-line description followed by lines containing the sequence itself. The description is indicated by a greater than (>) symbol in the first column. The word following this character and up to the first space is the sequence identifier, followed by an optional description. The next few lines may have a semicolon (“;”) as the first character, and then they will be treated as comments. At the moment, many databases and programs do not recognize comments, so they are not widely used. This is followed by lines containing the actual biological sequences. Typically, FASTA strings are limited to 80 to 120 characters (for historical reasons), but modern programs recognize sequences written entirely on a single line. Several sequences can be written to one file, thus a multi-FASTA file is obtained, however, each sequence must have its own identifier. An example of one sequence in FASTA format:
> gi | 31563518 | ref | NP_852610.1 | microtubule-associated proteins 1A / 1B light chain 3A isoform b [Homo sapiens]
MKMRFFSSPCGKAAVDPADRCKEVQQIRDQHPSKIPVIIERYKGEKQLPVLDKTKFLVPDHVNMSELVKI
IRRRLQLNPTQAFFLLVNQHSMVSVSTPIADIYEQEKDEDGFLYMVYASQETFGFIRENE
The identifier for this sequence is gi | 31563518 | ref | NP_852610.1 |.
Sequences are written as one-letter nucleotide or amino acid codes, matching their standard IUB / IUPAC one-letter designations, in order from 5′- to 3′-end for nucleic acids and from N- to C-end for amino acids. spaces are allowed in them, characters can be either upper or lower case. Numbers, line terminators, and tabs are ignored by sequence programs.
Nucleic acids are indicated as follows:
Code | Meaning | Mnemonic |
A | A | Adenine |
C | C | Cytosine |
G | G | Guanine |
T | T | Thymine (5-methyluracil) |
U | U | Uracil |
R | A, G | Purines |
Y | C, T, U | Pyrimidines |
K | G, T, U | Ketone bases |
M | A, C | Bases with amino groups (aMino) |
S | C, G | Strong interaction in a complementary pair (three hydrogen bonds) |
W | A, T, U | Weak interaction in a complementary pair (two hydrogen bonds) |
B | not A (i.e. C, G, T, or U) | B comes after A |
D | not C (i.e. A, G, T, or U) | D comes after C |
H | not G (A, C, T or U) | H goes after G |
V | not T and not U (A, C or G) | V follows U |
N | A C G T U | Any (aNy) nucleotide |
For amino acids, there are 22 common codes (canonical amino acids, selenocysteine and pyrrolysine), 4 special codes (designations for multiple amino acids) and * for a stop codon (in formal gene translations).
Amino acid code | Meaning |
A | Alanine |
B | Aspartic acid (D) or Asparagine (N) |
C | Cysteine |
D | Aspartic acid |
E | Glutamic acid |
F | Phenylalanine |
G | Glycine |
H | Histidine |
I | isoleucine |
J | Leucine (L) or Isoleucine (I) |
K | Lysine |
L | Leucine |
M | Methionine |
N | Asparagine |
O | Pyrrolysine |
P | Proline |
Q | Glutamine |
R | Arginine |
S | Serine |
T | Threonine |
U | Selenocysteine |
V | Valine |
W | Tryptophan |
Y | Tyrosine |
Z | Glutamic acid (E) or Glutamine (Q) |
X | Any amino acid |
* | Broadcast termination |
Fasta format is also used for files containing biological sequence alignments. In this case, “gaps” (usually a hyphen or a period) are inserted into each sequence in places corresponding to positions not presented in this sequence, as a result, all sequences in the file must have the same length.
The NCBI has defined rules for generating unique sequence identifiers (SeqIDs). The following variants of identifiers are allowed in the description line:
Type | Format(s) | Example(s) |
Local (does not refer to external databases) | lcl | integer lcl | string |
lcl | 123 lcl | hmm271 |
GenInfo skeleton sequence identifier | bbs | integer | bbs | 123 |
GenInfo backbone molecule type | bbm | integer | bbm | 123 |
GenInfo import ID | gim | integer | gim | 123 |
GenBank | gb | access code | locus | GB | M73307 | AGMA13GT |
EMBL | emb | access code | locus | emb | CAM43271.1 | |
PIR | pir | access code | name | pir || G36364 |
SWISS-PROT | sp | access code | name | sp | P01013 | OVAX_CHICK |
Patent | pat | country | patent | sequence number | pat | US | RE33188 | 1 |
Patent application | pgp | country | application number | sequence number | pgp | EP | 0238993 | 7 |
RefSeq | ref | access code | name | ref | NM_010450.1 | |
Database link not from this list | gnl | database | integer gnl | database | string |
gnl | taxon | 9606 gnl | PID | e1632 |
GenInfo Integrated Database | gi | integer | gi|21434723 |
DDBJ | dbj | access code | locus | dbj | BAC85684.1 | |
PRF | prf | access code | name | prf || 0806162C |
PDB | pdb | write | chain | pdb | 1I4L | D |
GenBank with annotations from third parties | tpg | access code | name | tpg | BK003456 | |
EMBL with annotations from third parties | tpe | access code | name | tpe | BN000123 | |
DDBJ with third party annotations | tpd | access code | name | tpd | FAA00017 | |
TrEMBL | tr | access code | name | tr | Q90RT2 | Q90RT2_9HIV1 |
Vertical bars (“|”) in the list above are not separators, but part of the format. You can put identifiers in a row, separating them with lines. If any of the identifier fields is left empty, to ensure compatibility with programs, you must put two dashes in a row
Fasta files can have different extensions depending on the nature of the biological data they contain.
Extension | Meaning | Notes |
fasta | Fasta regular data | Any data fasta. Sometimes also .fa, .seq, .fsa, .fas |
fna | abbr. from “fasta nucleic acid” | For describing nucleotide sequences |
ffn | Nucleotide coding regions | Contain coding regions of genomes |
faa | abbr. from “fasta amino acid” | Contain amino acid sequences. The mpfa extension is used when storing multiple proteins in a single file. |
frn | Non-coding RNA in FASTA format | Contain non-coding RNAs in the DNA alphabet, e.g. tRNA, rRNA |
afa, mfa | FASTA alignment (a for “alignment”, m for “multiple”) | Contain biological (nucleotide or amino acid) sequence alignments |