What is HAMAP?





General

HAMAP stands for High-quality Automated and Manual Annotation of Proteins.

Due to the quantity of data produced today thanks to next-generation sequencing and the ever increasing rate of complete genome sequencing, it is no longer possible to manually annotate even a small portion of these genomes, despite the considerable demand for corrected and annotated complete proteome sets. To enrich their annotation in UniProtKB, we developed HAMAP, whose goal is to automatically annotate a significant percentage of the huge amount of proteins originating from complete genome sequencing projects. This automatic annotation pipeline, based on a collection of family profiles and manually created annotation rules, is only applied in cases where it can produce the same quality as manual annotation would, that is for proteins that are part of well-defined families or subfamilies. By this we mean protein families which have a well-defined function and which are well conserved at the sequence level.

The criteria to assign initial membership to a family are sequence similarity and what is known in the literature about the protein in question. The "seed members" are manually chosen and aligned. This "seed alignment" is then used to automatically generate a HAMAP family profile (for more details see "Automated annotation of microbial proteomes in Swiss-Prot". Comput. Biol. Chem. 27:49-58(2003).)
Sometimes, we need to use a somewhat different approach for the annotation of large and complex paralogous families (e.g. ABC transporters). For these, stringent profiles are required to distinguish between functional subfamilies regarding the transported substrate. For ABC transporters, manually built PROSITE profiles are used to assign family membership and there are no seed alignments.

To create each HAMAP annotation rule, the available literature is consulted and all proteins for which there is experimental characterization are manually annotated to Swiss-Prot standard. These proteins are called "templates". Decisions are made regarding what annotation can be safely propagated to orthologs. The use of "cases" (for example: restriction on the propagation of the annotation to a taxonomic group, dependence on the detection of a certain conserved active-site amino acid residue, etc.; see examples below) helps to limit the extension of the propagation if more characterization is lacking and it is not safe to assume that the same function, subunit, cofactor, etc. apply to all members of a protein family.

The HAMAP automatic pipeline is then used to annotate UniProtKB, in the following way: UniProtKB/TrEMBL sequences that match one of the HAMAP profiles are annotated using the associated annotation rule. Many checks are performed in order to prevent the propagation of wrong annotation, and any problematic cases are filtered out. The results of this annotation are integrated into UniProtKB/TrEMBL.

A relational database that supports incremental updates has been developed to store annotation rules, profiles, sequences and hits.

HAMAP content

The view of each profile contains:
The view of each annotation rule contains:
For more details, please consult the User Manual for the Web View. It is also available by clicking the headers of each section of a HAMAP rule page.

Technical aspects

HAMAP and how it is updated

See the HAMAP rules page for up-to-date statistics about the number and taxonomic coverage of HAMAP rules and profiles. See the proteomes page for up-to-date statistics about the number of complete prokaryotic proteomes currently available in UniProtKB and the coverage of HAMAP profiles in each proteome. The HAMAP release is concurrent with every release of UniProtKB, which takes place every 4 weeks. New families are added in each release, and existing families are periodically updated.

How to tell if a UniProtKB entry has been automatically annotated

Automatically annotated entries present these general features: An example: A8A671.

Cross-references from UniProtKB to HAMAP

Cross-references are present in all UniProtKB entries that are matched by a HAMAP family profile (or several). These cross-references are found in the "Cross-references/Family and domain databases" section of the entry, and have the following format:
HAMAP; profile-identifier; profile-name; count; status.

The identifiers are:
Profile-identifier: Unique identifier for a HAMAP family profile
Profile-name: Name of the HAMAP family profile
Count: Number of domains found in the protein, generally '1', occasionally '2' for the fusion of 2 identical domains.
Status: The values are either '-', 'fused', 'atypical' or 'atypical/fused'. The value '-' is a placeholder for an empty field; the 'fused' value indicates that the family rule does not cover the entire protein; the value 'atypical' points out that the protein is divergent in sequence or has mutated functional sites, and should not be included in family datasets; the value 'atypical/fused' indicates the last 2 findings.

Example: HAMAP; MF_01885; tRNA_methyltr_TrmL; 1; -.

Feature propagation

Protein features (and associated comments and keywords) are propagated automatically using two different approaches.
HAMAP profiles

The HAMAP profiles that are used to identify family members are generated from a manually produced seed alignment using an automatic procedure based on the method used to generate PROSITE profiles (see Sigrist et al., Brief. Bioinform. 3(3):265-274 (2002)).

Accessing HAMAP

The most efficient and user-friendly way to access HAMAP data is to browse interactively the ExPASy server, at http://hamap.expasy.org.

Downloading HAMAP data

Annotation rules, profiles and alignments are available at the HAMAP ftp section ftp://ftp.expasy.org/databases/hamap.

Frequently asked questions (FAQ)

Will HAMAP be extended to eukaryotes?

Yes. HAMAP was originally developed for the annotation of proteins from completely sequenced bacteria, archaea and plastids, but we have now started to produce and integrate also HAMAP rules and profiles targeting eukaryotic protein families.

What is the coverage of HAMAP in a genome?

Since family rules have been built with a bias toward well-studied phyla and housekeeping genes, the coverage is dependent on the organism type and the genome size. HAMAP covers up to 64% of small genomes such as Buchnera aphidicola (subsp. Acyrthosiphon pisum), to 25% for the model organism Escherichia coli K12, to less than 6% for the large genomes of some Streptomyces species (e.g. Streptomyces bingchenggensis).

Is it possible to annotate all the proteins of a new complete genome using HAMAP?

In certain new genomes, it is possible to annotate just over half of the proteins automatically with the current set of rules. This coverage is constantly expanded with the addition of new rules. However, the current approach is intrinsically limited to 'well-behaved' orthologous families.

Why and when do we merge genomes in UniProtKB?

Historically when a new genome arrived we used to merge it with all preexisting UniProt entries of that bacterium or archaea, regardless of the strain of the new genome. As more and more strains of the same organism were sequenced it has become evident that this is no longer appropriate or desirable. Thus we now usually assign a new species code for each new strain of a particular organism, and then merge the new entries with any entries of the same strain that already exist in UniProtKB/Swiss-Prot. Presently we only merge complete microbial genomes when they are from the same strain.