Mission Statement
Gene ontology annotation is a common problem in bioinformatics in which genes must be associated with their properties in living organisms, with the properties being represented by Gene Ontology (GO) terms. In its most common form, GO annotation is the problem of selecting a protein's properties based on its sequence. This works very naturally as a machine learning problem of learning function based on sequence, but also as a bioinformatics problem of associating proteins with similar functions.
Current work in GO annotation is largely based on the Critical Assessment of protein Function Annotation algorithms (CAFA) competitions. In these competitions, a large set of proteins with unknown function is reserved, and hundreds of teams compete to make predictions on them. After a several months, many of these proteins are experimentally annotated, and competitors are evaluated based on their accuracy within the competition. The CAFA competition works extremely well for evaluation, since it's nearly impossible to cheat, and teams can safely use any available source of data. However, its downside is that it requires more than a year between evaluations.
Currently, it's typical for new GO annotation papers to recreate the CAFA dataset using the same methodology so that they can evaluate against previous models. However, this is labor intensive, and inexact. Each paper may have its own methodology and provide subtly different comparisons, issues that are compounded by the fact that different models may require differently constructed datasets.
There is no perfect GO annotation dataset. However, this web application offers a set of several customizable datasets designed to be highly compatible, so that GO annotations models can be efficiently trained and compared at any time. Dataset format and sources are very similar to those of CAFA, and will be familiar for any participants in the competition.
GO Pipeline
GO Bench is oriented towards ML applications in which models are trained using a large training dataset, and then tuned and tested using validation and testing sets. By convention, it generates a train/validation/test split of 0.7/0.15/0.15.
Based on the options selected on the dataset form page, a large set of protein annotations are be extracted from the GOA database, filtered, and postprocessed. The resulting tab separated file can be used directly for training or evaluating models.
GO Annotation is treated as a multi-class, multi-label learning problem, in which proteins are associated with any number of GO terms, so a dataset for the model is simply a set of Uniprot identifiers, with each identifier associated with some number of GO IDs. FASTA files containing protein sequences for each identifier are available from UniProt Swiss-Prot.
Dataset Form Configuration
Datasets are configurable based on evidence codes, annotation propagation, date of annotation, namespace, and split method.
Evidence Codes
Evidence codes represent the source of a given protein annotation, and are included in the GOA database. A quick guide to evidence codes can be found here: GO Evidence Codes Each evidence code is represented by a short acronym in the dataset form, and can be selected to include annotations based on that evidence code in the downloadable dataset. For instance, the EXP code indicates that protein annotations are based on direct experimental evidence, and is highly trustworthy. So, we select the EXP evidence code by default.
In contrast, the IEA code indicates that an annotation is based entirely on computed heuristics. IEA coded annotations are very common, but relatively untrustworthy, so these are left out of our default selection.
For convenience, we organize evidence codes into three groups based on level of trustworthiness. These can be selected as default options in our form, and can be roughly described as "experimentally based annotations", "human reviewed annotations", and "all annotations". Based on our own experiments, most deep learning models get the highest performance using "human reviewed annotations".
Split Method
The split method for proteins determines how your dataset was split into training, validation, and testing.
Our recommended setting is cluster50, in which proteins with over 50% similarity are always grouped within the
same dataset. This is standard for most protein classification problems, and makes the problem setting more
challenging.
Random is the alternative setting in which proteins are divided blindly. A major benefit of random is that because
splits are made truely randomly, class counts have a more even distribution between training, validation and testing.
The cluster50 and random splits are pre-generated, and consistent between dataset requests. Other settings, such as
for annotation quality, may change the number of proteins in a dataset, but only by filtering training, validation, and
testing seperately.
Date Filtering
We allow GO annotations to be filtered by date, largely so that users can build datasets compatible with previous CAFA competitions.
Namespace
Gene Ontology terms are divided into three distinct catagories of Molecular Functions, Biological Processes, and Cellular Components. These represent different aspects of gene properties, and are almost always modeled separately. GO Bench outputs separate datasets for each of these namespaces.
Propogate Annotations
Our Gene Ontology dataset maps proteins to the associated GO terms, but only actually stores the most specific term associated with a given protein. For instance, we might store an association between the protein Q13219 and the GO term "hexose biosynthetic process", but we wouldn't explicitely store an association between Q13219 and the more general term "metabolic process".
The propogate annotations option augments our dataset with these implicit associations by propogating known annotations up the Gene Ontology tree. This greatly expands the number of annotations provided, but is also somewhat redundant.
Files
GO_Bench returns a gzipped tar file containing training, validation, and testing data for all selected namespaces. The output can be unzipped with "tar -xzvf GO_benchmark_data.tar.gz", as shown in our ipython notebook tutorial. Once unzipped, it should have the following structure.
.tsv files within the dataset contain two tab-separated columns, one representing Uniprot protein IDs and the other
listing GO IDs associated with the protein, separated by commas.
.json files list the GO terms found in training, validation, and testing sets. For completeness, testing sets always
contain every GO ID associated with the given prot ids, but training and validation sets may have rare terms filtered out
to improve training.
Benchmark Compatibility
Training and validation sets returned by our form are designed to any overlap with datasets from "Benchmarking Gene ontology Function Predictions Using Negative Annotations" (Vesztrocy, Dessimoz 2020). Datasets may also be made compliant with the CAFA Challenge by filtering by the proper maximum date.
Model Comparison
Models trained on GO Bench data can also be evaluated by submitting to our leaderboard. Prediction formats should be similar to what they were for the CAFA4 competition, with three tab-separated columns.
No header is required, and each row should contain a prediction of the form `{Uniprot ID} {GO ID} {Probability}`, where probability is a confidence indicator ranging from 0 to 1. Probabilities below 0.01 should be excluded.
After upload, you can view your results on the model visualizations panel, which generates precision-recall, S-min, and F-Score curves for the model predictions in their given category. Our convention is to record results for testing data generated with Experimental quality annotation codes, which minimize the chance of false positives in the evaluation set, but have a higher chance of having false negatives.