What is HAMAP?
General
HAMAP stands for High-quality Automated and Manual Annotation of Proteins.
Due to the quantity of data produced today thanks to next-generation sequencing and the ever increasing rate of complete genome sequencing, it is no longer possible to manually annotate even a small portion of these genomes, despite the considerable demand for corrected and annotated complete proteome sets. To enrich their annotation in UniProtKB, we developed HAMAP, whose goal is to automatically annotate a significant percentage of the huge amount of proteins originating from complete genome sequencing projects. This automatic annotation pipeline, based on a collection of family profiles and manually created annotation rules, is only applied in cases where it can produce the same quality as manual annotation would, that is for proteins that are part of well-defined families or subfamilies. By this we mean protein families which have a well-defined function and which are well conserved at the sequence level. HAMAP was originally developed for the annotation of proteins from completely sequenced bacteria, archaea and plastids, but we now produce and integrate also HAMAP rules and profiles targeting eukaryotic and viral protein families.
HAMAP family profiles
HAMAP family profiles are manually curated signatures used to determine protein family membership of query protein sequences. The criteria to assign initial membership to a family are sequence similarity and what is known in the literature about the protein in question. These "seed members" are manually chosen and aligned. This "seed alignment" is then used to automatically generate a HAMAP family profile (for more details see our " Standard operating procedure (SOP) for HAMAP family profiles creation " document).
Sometimes, we need to use a somewhat different approach for the annotation of large and complex paralogous families (e.g. ABC transporters). For these, stringent profiles are required to distinguish between functional subfamilies regarding the transported substrate. For ABC transporters, manually built PROSITE profiles are used to assign family membership and there are no seed alignments.
HAMAP annotation rules
The manually created HAMAP annotation rules specify annotations and the conditions under which they may be applied. To create each HAMAP annotation rule, the available literature is consulted and all proteins for which there is experimental characterization are manually annotated to Swiss-Prot standard. The annotation of these proteins is combined to build the rule containing the annotation to be propagated. Decisions are made regarding what annotation can be safely propagated to orthologs. The use of "cases" (for example: restriction on the propagation of the annotation to a taxonomic group, dependence on the detection of a certain conserved active-site amino acid residue, etc.; see examples below) helps to limit the extension of the propagation if more characterization is lacking and it is not safe to assume that the same function, subunit, cofactor, etc. apply to all members of a protein family. (for more details see our " Standard operating procedure (SOP) for HAMAP annotation rule creation " document).
Accessing HAMAP
On the web
The most efficient and user-friendly way to access HAMAP data is to browse interactively the Expasy server, at https://hamap.expasy.org , where you can browse, search and view the HAMAP profiles and annotation rules.
The view of each profile contains:
- General information about the family profile: accession number, name, taxonomic range of the family profile, and the associated rule(s) that are used to generate annotations for matching protein sequences. Additionally, the seed alignment used to generate the profile as well as the profile itself can be viewed here.
- Statistics about profile matches in UniProtKB, including the number of hits in UniProtKB/Swiss-Prot and UniProtKB/TrEMBL, respectively, as well as the taxonomic distribution of the matches and a graphical view of the score distribution of the individual matches.
For more details, please consult the User Manual for the Web View , which is also available by clicking the headers of each section of a HAMAP profile page.
The view of each annotation rule contains:
- General information about the annotation rule: accession number, dates of creation and last revision, name and taxonomic scope of the rule, a list of characterized UniProtKB/Swiss-Prot template entries, and the family profile(s) used to detect family member sequences.
- Annotation that is propagated to member entries (e.g. protein name, gene name, comments, sequence features), including the conditions (taxonomic, metabolic or feature-based) that are used to control propagation.
- Keywords and Gene Ontology terms.
- Cross-references to protein-domain related databases (currently PROSITE, Pfam, PRINTS, TIGRFAMs and/or PIRSF)
- Characteristics of the family (e.g. size range, fusions, duplications, etc.).
- Comments about the family.
For more details, please consult the User Manual for the Web View , which is also available by clicking the headers of each section of a HAMAP rule page.
Download
Annotation rules, profiles and alignments are also available for download at the HAMAP ftp section https://ftp.expasy.org/databases/hamap .
How to use
A relational database that supports incremental updates has been developed to store annotation rules, profiles, sequences and hits. In brief, protein sequences are scanned against HAMAP family profiles to determine family membership and annotations are generated for positive matches by applying the annotation found in the corresponding HAMAP annotation rule resolving its conditional statements.
HAMAP-Scan
HAMAP can be used for the annotation of protein sequences via our web interface HAMAP-Scan , which accepts individual protein sequences or complete (microbial) proteomes for annotation. Please consult our " HAMAP-Scan User Manual " for further information on how to use this service.
HAMAP in UniProtKB
As part of the UniProt automatic annotation pipeline , HAMAP routinely provides annotations of Swiss-Prot quality for millions of unreviewed protein sequences in UniProtKB/TrEMBL. The HAMAP automatic pipeline is used to annotate UniProtKB in the following way: UniProtKB/TrEMBL sequences that match one of the HAMAP profiles are annotated using the associated annotation rule. Many checks are performed in order to prevent the propagation of wrong annotation, and any problematic cases are filtered out. The resulting annotations are integrated into UniProtKB/TrEMBL.
Automatically annotated entries present these general features:
-
A cross-reference to a HAMAP family profile it matches
Cross-references are present in all UniProtKB entries that are matched by a HAMAP family profile (or several). These cross-references are found in the "Family & Domains" section of the entry, and have the following format:
HAMAP; profile identifier; profile name; count.
The identifiers are:
- profile identifier
- unique identifier for a HAMAP family profile
- profile name
- name of the HAMAP family profile
- count
- number of domains found in the protein, generally '1', occasionally '2' for the fusion of 2 identical domains.
Example: HAMAP; MF_01885; tRNA_methyltr_TrmL; 1.
-
The source of annotations that are inferred by HAMAP is indicated with an evidence tag pointing to the HAMAP annotation rule that
generated the annotation
Evidence tags on UniProtKB annotations sourced from a HAMAP annotation rule are of type {ECO:0000256|HAMAP-Rule: Rule-identifier}.
Example: Q8ZL60 .
Frequently asked questions (FAQ)
What is the coverage of HAMAP in a genome?
Since family rules have been built with a bias toward well-studied phyla and housekeeping genes, the coverage is dependent on the organism type and the genome size. HAMAP covers up to 69% of small genomes such as Buchnera aphidicola (subsp. Acyrthosiphon pisum), to 26% for the model organism Escherichia coli K12, to a bit less than 6% for the large genomes of some Streptomyces species (e.g. Streptomyces bingchenggensis).
Is it possible to annotate all the proteins of a new complete genome using HAMAP?
In certain new genomes, it is possible to annotate just over half of the proteins automatically with the current set of rules. This coverage is constantly expanded with the addition of new rules. However, the current approach is intrinsically limited to 'well-behaved' orthologous families.