Biomedical Computation

What Is Fasta Format?

FASTA is a textual format for nucleotide or polypeptide sequences in which nucleotides or amino acids are designated using one-letter codes. Due to its simplicity and practicality, it is currently used by most biological sequencing programs. Files of this format can contain sequence names, their identifiers in databases and comments. Depending on the nature of the biological sequences it contains, the FASTA file can have various extensions.

History

The format was invented by David Lipman and William Pearson in 1985 for the program of the same name designed to search large databases of sequences homologous to the given one. The primary description of the format was made by them in the documentation of this program, and now its description is part of the documentation of the BLAST program.

The simplicity of the FASTA format allows you to easily perform various actions with sequences using text editing tools and scripting programming languages ​​such as Python, Ruby, Perl, Java.

The FASTA and FASTQ (Sanger Institute) formats are the most popular for the presentation of biological sequence data. There are also other formats, including those used in the GenBank, EMBL and UniProt.

Format

FASTA sequences begin with a one-line description followed by lines containing the sequence itself. The description is indicated by a greater than (>) symbol in the first column. The word following this character and up to the first space is the sequence identifier, followed by an optional description. The next few lines may have a semicolon (“;”) as the first character, and then they will be treated as comments. At the moment, many databases and programs do not recognize comments, so they are not widely used. This is followed by lines containing the actual biological sequences. Typically, FASTA strings are limited to 80 to 120 characters (for historical reasons), but modern programs recognize sequences written entirely on a single line. Several sequences can be written to one file, thus a multi-FASTA file is obtained, however, each sequence must have its own identifier. An example of one sequence in FASTA format:

> gi | 31563518 | ref | NP_852610.1 | microtubule-associated proteins 1A / 1B light chain 3A isoform b [Homo sapiens]
MKMRFFSSPCGKAAVDPADRCKEVQQIRDQHPSKIPVIIERYKGEKQLPVLDKTKFLVPDHVNMSELVKI
IRRRLQLNPTQAFFLLVNQHSMVSVSTPIADIYEQEKDEDGFLYMVYASQETFGFIRENE

The identifier for this sequence is gi | 31563518 | ref | NP_852610.1 |.

Sequences are written as one-letter nucleotide or amino acid codes, matching their standard IUB / IUPAC one-letter designations, in order from 5′- to 3′-end for nucleic acids and from N- to C-end for amino acids. spaces are allowed in them, characters can be either upper or lower case. Numbers, line terminators, and tabs are ignored by sequence programs.

Nucleic acids are indicated as follows:

Code Meaning Mnemonic
A A Adenine
C C Cytosine
G G Guanine
T T Thymine (5-methyluracil)
U U Uracil
R A, G Purines
Y C, T, U Pyrimidines
K G, T, U Ketone bases
M A, C Bases with amino groups (aMino)
S C, G Strong interaction in a complementary pair (three hydrogen bonds)
W A, T, U Weak interaction in a complementary pair (two hydrogen bonds)
B not A (i.e. C, G, T, or U) B comes after A
D not C (i.e. A, G, T, or U) D comes after C
H not G (A, C, T or U) H goes after G
V not T and not U (A, C or G) V follows U
N A C G T U Any (aNy) nucleotide

For amino acids, there are 22 common codes (canonical amino acids, selenocysteine and pyrrolysine), 4 special codes (designations for multiple amino acids) and * for a stop codon (in formal gene translations).

Amino acid code Meaning
A Alanine
B Aspartic acid (D) or Asparagine (N)
C Cysteine
D Aspartic acid
E Glutamic acid
F Phenylalanine
G Glycine
H Histidine
I isoleucine
J Leucine (L) or Isoleucine (I)
K Lysine
L Leucine
M Methionine
N Asparagine
O Pyrrolysine
P Proline
Q Glutamine
R Arginine
S Serine
T Threonine
U Selenocysteine
V Valine
W Tryptophan
Y Tyrosine
Z Glutamic acid (E) or Glutamine (Q)
X Any amino acid
* Broadcast termination

Fasta format is also used for files containing biological sequence alignments. In this case, “gaps” (usually a hyphen or a period) are inserted into each sequence in places corresponding to positions not presented in this sequence, as a result, all sequences in the file must have the same length.

Sequence identifiers

The NCBI has defined rules for generating unique sequence identifiers (SeqIDs). The following variants of identifiers are allowed in the description line:

Type  Format(s)  Example(s)
Local (does not refer to external databases) lcl | integer
lcl | string
lcl | 123
lcl | hmm271
GenInfo skeleton sequence identifier bbs | integer bbs | 123
GenInfo backbone molecule type bbm | integer bbm | 123
GenInfo import ID gim | integer gim | 123
GenBank gb | access code | locus GB | M73307 | AGMA13GT
EMBL emb | access code | locus emb | CAM43271.1 |
PIR pir | access code | name pir || G36364
SWISS-PROT sp | access code | name sp | P01013 | OVAX_CHICK
Patent pat | country | patent | sequence number pat | US | RE33188 | 1
Patent application pgp | country | application number | sequence number pgp | EP | 0238993 | 7
RefSeq ref | access code | name ref | NM_010450.1 |
Database link not from this list gnl | database | integer
gnl | database | string
gnl | taxon | 9606
gnl | PID | e1632
GenInfo Integrated Database gi | integer gi|21434723
DDBJ dbj | access code | locus dbj | BAC85684.1 |
PRF prf | access code | name prf || 0806162C
PDB pdb | write | chain pdb | 1I4L | D
GenBank with annotations from third parties tpg | access code | name tpg | BK003456 |
EMBL with annotations from third parties tpe | access code | name tpe | BN000123 |
DDBJ with third party annotations tpd | access code | name tpd | FAA00017 |
TrEMBL tr | access code | name tr | Q90RT2 | Q90RT2_9HIV1

Vertical bars (“|”) in the list above are not separators, but part of the format. You can put identifiers in a row, separating them with lines. If any of the identifier fields is left empty, to ensure compatibility with programs, you must put two dashes in a row

File extensions

Fasta files can have different extensions depending on the nature of the biological data they contain.

Extension Meaning Notes
fasta Fasta regular data Any data fasta. Sometimes also .fa, .seq, .fsa, .fas
fna abbr. from “fasta nucleic acid” For describing nucleotide sequences
ffn Nucleotide coding regions Contain coding regions of genomes
faa abbr. from “fasta amino acid” Contain amino acid sequences. The mpfa extension is used when storing multiple proteins in a single file.
frn Non-coding RNA in FASTA format Contain non-coding RNAs in the DNA alphabet, e.g. tRNA, rRNA
afa, mfa FASTA alignment (a for “alignment”, m for “multiple”) Contain biological (nucleotide or amino acid) sequence alignments