Universität Bonn

Protein Function Prediction

In the postgenomic era it is impossible to annotate the majority of new proteins in any other way than with computational methods. Our tool AHRD automatically annotates proteins with human readable descriptions and Gene Ontology (GO) terms on a genomic scale.

AHRD

Sequence similarity, computed for example by BLAST, is used for large scale annotation of protein sequences. These automated annotations propagate in non-curated protein databases as human readable descriptions or Gene Ontology annotations. To overcome error propagation by simple transfer of annotations from the most similar database match, we developed AHRD (Automatic assignment of Human Readable Descriptions). It is modeled on the work flow of human curators when evaluating similarity search results. Based on semantic similarity of GO sub graphs we optimized AHRD with heuristic machine learning algorithms. AHRD can overcome problems caused by wrong annotations, lack of similar sequences and partial alignments.

Prediction of Human Readable Descriptions

AHRD removes descriptions that contain indicators for a previous annotation transfer because the description should be as close to the primary source as possible. After that, the descriptions are deconstructed into their words, from here on referred to as tokens. Tokens known to be common to all kinds of protein descriptions and generally considered as uninformative are ignored. All others are scored by their abundance in descriptions from proteins with a high bit-score search result, a good alignment overlap and an origin in a trusted database. Then, the description candidates can be ranked according to their tokens and the top result can be used for annotation transfer to the query.

Prediction of Gene Ontology Terms

Previous versions of AHRD transferred GO annotations from reference proteins
scored solely on the characteristic of their human readable descriptions. To increase AHRD’s GO term annotation performance we implemented a candidate protein scoring procedure based directly on GO annotations. But for the prediction of GO terms it is just as important to avoid electronically transferred protein annotations as it is for the description prediction. We thus subject the candidate reference proteins for the annotation with GO terms to the same filtering steps performed on the description candidates. So AHRD’s
GO term annotation procedure also benefits from quality indicators derived from
human readable descriptions. GO terms are then scored according to how often they are found in the annotations of proteins falling under the following criteria: They are from curated databases, they have a high search score (with respect to the query), have a good alignment overlap with the query and are annotated with experimental evidence codes. The candidate protein with the highest scoring GO term annotation is then used for transfer of the function to the query.


AHRD in the CAFA Challenge

The CAFA (Critical Assessment of Functional Annotation) challenge22 is a recurring community-wide contest to test competing computational protein function prediction tools. We participated in CAFA3 and CAFA-π (https://doi.org/10.1186/s13059-019-1835-833)) ) in 2017. And in CAFA4 in 2019.


AHRD Development

AHRD is developed in JAVA and version controlled using git. It is built with ant and freely available on GitHub. Because AHRD is a terminal application it is possible to integrate it in existing workflows to facilitate automation.

Employing Conda and Snakemake we created a workflow wrapper for AHRD called AHRD_Snakemake. It makes it very easy to annotate a query FASTA by taking care of all necessary downloads, sequence similarity searches and parameter configuration while also dealing with the required software dependencies. Of course AHRD_Snakemake is also freely available on GitHub.


Wird geladen