Biomedical Computation

Gene Ontology

Gene Ontology (GO) is a bioinformatics project dedicated to the creation of a unified terminology for annotating genes and gene products of all biological species.

The goal of the project is to maintain and replenish a certain list of attributes of genes and their products, compile annotations of genes and products, develop tools for working with the project database, as well as for analyzing new experimental data, in particular, analyzing the representation of functional groups of genes. It should be noted that the GO project created a markup language for data classification (information about genes and their products, that is, RNA and proteins, as well as their functions), which allows you to quickly find systematized information about gene products.

Gene Ontology is part of a larger classification project – Open Biomedical Ontologies (OBO)

History and current status

Ontologies in computer science are used to formalize certain areas of knowledge using a system of data about objects in the real world and connections between them (the so-called knowledge base). In biology and related disciplines, the problem arose of the lack of a universal standard for terminology. Terms that express similar concepts but are used for different biological species, different areas of research, or even within different groups of scientists, can have fundamentally different meanings, which complicates the exchange of data. In this regard, the task of the “Gene Ontology” project was to create an ontology of terms reflecting the properties of genes and their products and applicable to any organisms.

Gene Ontology was created in 1998 by a consortium of scientists who studied the genomes of three model organisms: Drosophila melanogaster (fruit fly), Mus musculus (mouse), and Saccharomyces cerevisiae (baker’s yeast). Then many databases for other model organisms joined the GO Consortium, thereby contributing not only to the expansion of the annotation base but also to the creation of services for viewing and using data.

The GO Consortium (GOC) is a set of biological databases and research groups actively participating in the Gene Ontology project. It includes several databases for various model organisms, common protein databases, software development teams, and gene ontology editors.

Gene Ontology is a large-scale and rapidly developing project. As of September 2011, “Gene Ontology” contained more than 33 thousand terms and about 12 million annotations of gene products applicable to more than 360 thousand living organisms. At the end of 2016, the number of terms exceeded 44 thousand copies, while the number of organisms annotated in this knowledge base exceeded the mark of 460 thousand individuals.

Over the past few years, the GO Consortium has implemented a number of ontology changes to increase the quantity, quality, and specificity of GO annotations. By 2013, the number of annotations exceeded 96 million. The quality of annotations was improved through automated quality checks. The annotation of the data presented in the GO database has also improved, new terms have been added. In 2007, a new service, InterMine, was created, the purpose of which is to integrate genomic data from a large number of disparate sources and to facilitate computational tasks such as finding specific genomic regions and performing statistical tests. The project was originally created to integrate data for Drosophila, but currently includes a large number of model organisms. In recent years, the development of the LEGO service (Linked Expressions using the Gene Ontology) is underway, which allows you to study the interaction of various annotations in the GO database, combining them into more general models of genes and their functions.

Structure and terms

One should understand that Gene Ontology describes complex biological phenomena, not specific biological objects. The Gene Ontology database includes three independent dictionaries:

  • Molecular function – classification according to the specific function of a gene product (protein or RNA) at the molecular level, for example, carbohydrate binding or ATPase activity;
  • Biological processes (English biological process) – classification according to a complex process, usually necessary for the vital activity of organisms and occurring due to the implementation of a sequence of molecular reactions, for example, mitosis or biosynthesis of purines;
  • Cellular component – classification according to the part of the cell or extracellular space where the function of the gene product is carried out, for example, the nucleus or the ribosome.

Each term in “Gene Ontology” has a number of attributes: a unique digital identifier, a name, a dictionary to which the term belongs, and a definition. The terms can have synonyms, which are divided into exactly corresponding to the meaning of the term, broader, narrower and somewhat related to the term. Attributes such as links to sources, to other databases and comments on the meaning and use of the term may also be present.

The ontology is built on the principle of a directed acyclic graph: each term is associated with one or more other terms through a different type of relationship.

The types of relations:

  • “A is a B” – A is a special case of B;
  • “A part of B” – A is part of B;
  • “B has part A” – B includes A;
  • “A regulates B” – A regulates B;
  • “A positively regulates B” – A positively regulates B;
  • “A negatively regulates B” – A negatively regulates B;
  • “A occurs in B” – A occurs at B.

Changes and additions are constantly made to the “Gene Ontology” database both by the curators of the GO project and by other researchers. Proposed user amendments are reviewed by the project editors and applied if the amendments are approved.

The file containing the entire database can be obtained in various formats on the official website of “Gene Ontology”, as well as terms are available online using the “Gene Ontology” AmiGO browser. In addition, it can be used to extract a data array of gene products related to a particular term. You can also download maps of the correspondence of GO terms to other classification systems on the site.

Annotations

Genome annotation is aimed at obtaining information about the properties of gene products. GO annotations use the terms “Gene Ontology” for this. GO Consortium members post their annotations on the Gene Ontology website, where the annotations are available for direct download or for viewing in the AmiGO browser.

The gene annotation contains the following data: name and identifier of the gene product; related term GO; the type of data on which the annotation is based (evidence code); link to the source; as well as the creator and date of the annotation. There is a specific ontology for data types indicating evidence code, that is specific to the OBO project. It includes various methods of annotation, both manual and automatic. For example:

  • IDA (Inferred from Direct Assay) – experimental data;
  • TAS (Traceable Author Statement) – data from a scientific publication;
  • IMP (Inferred from Mutant Phenotype) – data obtained based on the mutant phenotype;
  • IGI (Inferred from Genetic Interaction) – based on gene interaction;
  • IPI (Inferred from Physical Interaction) – based on physical interaction;
  • RCA (Inferred from Reviewed Computational Analysis) – based on reliable computational analysis;
  • ISS (Inferred from Sequence Similarity) – based on sequence similarity;
  • IGC (Inferred from Genomic Context) – based on the genomic context;
  • IEP (Inferred from Expression Pattern) – based on the nature of the expression;
  • NAS (Non-traceable Author Statement) – based on unpublished data;
  • IEA (Inferred from Electronic Annotation) – based on automatic extraction of annotation from other databases;
  • IC (Inferred by Curator) – data assigned by the curator;
  • ND (No biological Data available) – no reliable data available.

As of September 2012, more than 99% of all annotations of the “Gene Ontology” were obtained automatically. Since such annotations are not manually checked, the GO Consortium considers them to be less reliable, and only some of them are available in the AmiGO browser. The complete base of annotations can be downloaded from the site of Gene Ontology.