|What is HAMAP?
HAMAP stands for H
utomated and M
nnotation of P
Due to the quantity of data produced today thanks to next-generation sequencing and the ever increasing rate of complete genome sequencing, it is no longer possible to manually annotate even a small portion of these genomes, despite the considerable demand for corrected and annotated
complete proteome sets. To enrich their annotation in UniProtKB
, we developed HAMAP, whose goal is to automatically annotate a significant percentage of the huge amount of proteins originating from complete genome sequencing projects.
This automatic annotation pipeline, based on a collection of family profiles and manually created annotation rules, is only applied in cases where it can produce the same quality as manual annotation would, that is for proteins that are part of well-defined families or subfamilies. By this
we mean protein families which have a well-defined function and which are well conserved at the sequence level.
The criteria to assign initial membership to a family are sequence similarity and what is known in the literature about the protein in question. The "seed members" are manually chosen and aligned. This "seed alignment" is then used to automatically generate a HAMAP family profile (for more
details see "Automated annotation of microbial proteomes in Swiss-Prot
". Comput. Biol. Chem. 27:49-58(2003).)
Sometimes, we need to use a somewhat different approach for the annotation of large and complex paralogous families (e.g.
ABC transporters). For these, stringent profiles are required to distinguish between functional subfamilies regarding the transported substrate. For ABC
transporters, manually built PROSITE profiles are used to assign family membership and there are no seed alignments.
To create each HAMAP annotation rule, the available literature is consulted and all proteins for which there is experimental characterization are manually annotated to Swiss-Prot standard. These proteins are called "templates". Decisions are made regarding what annotation can be safely
propagated to orthologs. The use of "cases" (for example: restriction on the propagation of the annotation to a taxonomic group, dependence on the detection of a certain conserved active-site amino acid residue, etc.; see examples below) helps to limit the extension of the propagation if
more characterization is lacking and it is not safe to assume that the same function, subunit, cofactor, etc. apply to all members of a protein family.
The HAMAP automatic pipeline is then used to annotate UniProtKB, in the following way: UniProtKB/TrEMBL sequences that match one of the HAMAP profiles are annotated using the associated annotation rule. Many checks are performed in order to prevent the propagation of wrong annotation, and
any problematic cases are filtered out. The results of this annotation are integrated into UniProtKB/TrEMBL.
A relational database that supports incremental updates has been developed to store annotation rules, profiles, sequences and hits.
The view of each profile contains:
- General information about the family profile: accession number, name, taxonomic range of the family profile, and the associated rule(s) that are used to generate annotations for matching protein sequences. Additionally, the seed alignment used to generate the profile as well as the
profile itself can be viewed here.
- Statistics about profile matches in UniProtKB, including the number of hits in UniProtKB/Swiss-Prot and UniProtKB/TrEMBL, respectively, as well as the taxonomic distribution of the matches and a graphical view of the score distribution of the individual matches. It is also possible
to retrieve all matches of a HAMAP profile in UniProtKB directly from this page.
Example: HAMAP profile MF_01962
For more details, please consult the User Manual for the Web View
. It is also available by clicking the headers of each section of a HAMAP profile page.
The view of each annotation rule contains:
- General information about the annotation rule: accession number, dates of creation and last revision, the family profile(s) used to detect family member sequences, name and taxonomic scope of the rule.
- Annotation that is propagated to member entries (e.g. protein name, gene name, comments), including the conditions (taxonomic, metabolic or feature-based) that are used to control propagation.
- Keywords and Gene Ontology terms.
- Cross-references to protein-domain related databases (currently PROSITE, Pfam, PRINTS, TIGRFAMs and/or PIRSF)
- Computed features (e.g. export signals, transmembrane regions) that may be applied to entries by using appropriate prediction programs (see below, "Technical aspects, Computed features").
- Propagated features, such as metal binding sites, active sites, etc. These are annotated based on the presence of conserved amino acids, as determined by aligning potential new members against a multiple alignment (the seed alignment) that includes a characterized template entry
(see below, "Technical aspects, Propagated features").
- Characteristics of the family (e.g. size range, fusions, duplications, etc.). In this section, the entry that was used as a template is also indicated; there can be more than one template.
- Comments about the family.
Example: HAMAP rule MF_01962
For more details, please consult the User Manual for the Web View
. It is also available by clicking the headers of each section of a HAMAP rule page.
HAMAP and how it is updated
Find up-to-date statistics about the number of HAMAP rules and profiles on the HAMAP website
. See the proteomes
page for statistics about the number of complete prokaryotic proteomes currently available in UniProtKB and the
coverage of HAMAP profiles in each proteome. The HAMAP release is concurrent with every release of UniProtKB, which takes place every 4 weeks. New families are added in each release, and existing families are periodically updated.
Cross-references from UniProtKB to HAMAP
Cross-references are present in all UniProtKB entries that are matched by a HAMAP family profile (or several). These cross-references are found in the "Cross-references/Family and domain databases" section of the entry, and have the following format:
HAMAP; profile-identifier; profile-name; count; status
The identifiers are:
||Unique identifier for a HAMAP family profile
||Name of the HAMAP family profile
||Number of domains found in the protein, generally '1', occasionally '2' for the fusion of 2 identical domains.
||The values are either '-', 'fused', 'atypical' or 'atypical/fused'. The value '-' is a placeholder for an empty field; the 'fused' value indicates that the family profile does not cover the entire protein; the value 'atypical' points out that the protein is
divergent in sequence or has mutated functional sites, and should not be included in family datasets; the value 'atypical/fused' indicates the last 2 findings.
Example: HAMAP; MF_01885; tRNA_methyltr_TrmL; 1; -.
How to tell if a UniProtKB entry has been automatically annotated
Automatically annotated entries present these general features:
- A cross-reference to a HAMAP family profile it matches (see following section),
- Features and comments that are inferred by automatic methods are marked with the adjectives "By similarity" or "Potential" (See the UniProt document Non-experimental qualifiers) and the source of
annotation is indicated at the end of the feature/comment with an evidence tag pointing to the HAMAP annotation rule that generated the annotation.
An example: A8A671
Protein features (and associated comments and keywords) are propagated automatically using two different approaches.
- Propagated features are propagated on the basis of their conservation throughout the family. The seed alignment(s) of the family containing representative, manually annotated family entries is used to transfer features from the annotation rule to family members,
provided that conserved residues as indicated in the rule are found at the corresponding positions.
Computed features are predicted using the following ad hoc methods:
The HAMAP profiles that are used to identify family members are generated from a manually produced seed alignment using an automatic procedure based on the method used to generate PROSITE profiles (see Sigrist et al., Brief. Bioinform. 3(3):265-274 (2002)
The most efficient and user-friendly way to access HAMAP data is to browse interactively the ExPASy server, at http://hamap.expasy.org
Downloading HAMAP data
Annotation rules, profiles and alignments are available at the HAMAP ftp section ftp://ftp.expasy.org/databases/hamap
Frequently asked questions (FAQ)
Will HAMAP be extended to eukaryotes?
Yes. HAMAP was originally developed for the annotation of proteins from completely sequenced bacteria, archaea and plastids, but we have now started to produce and integrate also HAMAP rules and profiles targeting eukaryotic protein families.
What is the coverage of HAMAP in a genome?
Since family rules have been built with a bias toward well-studied phyla and housekeeping genes, the coverage is dependent on the organism type and the genome size. HAMAP covers up to 66% of small genomes such as Buchnera aphidicola (subsp. Acyrthosiphon
, to 24% for the model organism Escherichia coli K12
, to less than 6% for the large genomes of some Streptomyces species (e.g. Streptomyces bingchenggensis
Is it possible to annotate all the proteins of a new complete genome using HAMAP?
In certain new genomes, it is possible to annotate just over half of the proteins automatically with the current set of rules. This coverage is constantly expanded with the addition of new rules. However, the current approach is intrinsically limited to 'well-behaved' orthologous