 |
|
| What is HAMAP? |
General
HAMAP stands for
High-quality
Automated and
Manual
Annotation of
Proteins.
Due to the quantity of data produced today thanks to next-generation sequencing and the ever increasing rate
of complete genome sequencing, it is no longer possible to manually annotate even a small portion of these
genomes, despite the considerable demand for corrected and annotated complete proteome sets. To enrich their
annotation in
UniProtKB, we developed HAMAP,
whose goal is to automatically annotate a significant percentage of the huge amount of proteins originating
from complete genome sequencing projects. This automatic annotation pipeline, based on a collection of family
profiles and manually created annotation rules, is only applied in cases where it can produce the same quality
as manual annotation would, that is for proteins that are part of well-defined families or subfamilies.
By this we mean protein families which have a well-defined function and which are well conserved
at the sequence level.
The criteria to assign initial membership to a family are sequence similarity and what is known in the
literature about the protein in question. The "seed members" are manually chosen and aligned. This
"seed alignment" is then used to automatically generate a HAMAP family profile (for more details see
"Automated annotation of microbial
proteomes in Swiss-Prot". Comput. Biol. Chem. 27:49-58(2003).)
Sometimes, we need to use a somewhat different approach for the annotation of large and complex paralogous
families (
e.g. ABC transporters). For these, stringent profiles are required to distinguish between
functional subfamilies regarding the transported substrate. For ABC transporters, manually built PROSITE
profiles are used to assign family membership and there are no seed alignments.
To create each HAMAP annotation rule, the available literature is consulted and all proteins for which there
is experimental characterization are manually annotated to Swiss-Prot standard. These proteins are called
"templates". Decisions are made regarding what annotation can be safely propagated to orthologs. The use of
"cases" (for example: restriction on the propagation of the annotation to a taxonomic group, dependence on the
detection of a certain conserved active-site amino acid residue, etc.; see examples below) helps to limit the
extension of the propagation if more characterization is lacking and it is not safe to assume that the same
function, subunit, cofactor, etc. apply to all members of a protein family.
The HAMAP automatic pipeline is then used to annotate UniProtKB, in the following way: UniProtKB/TrEMBL
sequences that match one of the HAMAP profiles are annotated using the associated annotation rule.
Many checks are performed in order to prevent the propagation of wrong annotation, and any problematic cases are
filtered out. The results of this annotation are integrated into UniProtKB/TrEMBL.
A relational database that supports incremental updates has been developed to store annotation rules, profiles,
sequences and hits.
HAMAP content
The view of each profile contains:
- General information about the family profile: accession number, name, taxonomic range of the
family profile, and the associated rule(s) that are used to generate annotations for matching
protein sequences. Additionally, the seed alignment used to generate the profile as well as the
profile itself can be viewed here.
- Statistics about profile matches in UniProtKB, including the number of hits in UniProtKB/Swiss-Prot
and UniProtKB/TrEMBL, respectively, as well as the taxonomic distribution of the matches and a
graphical view of the score distribution of the individual matches. It is also possible
to retrieve all matches of a HAMAP profile in UniProtKB directly from this page.
Example: HAMAP profile MF_01962
The view of each annotation rule contains:
- General information about the annotation rule: accession number, dates of creation and last revision,
the family profile(s) used to detect family member sequences, name and taxonomic scope of
the rule.
- Annotation that is propagated to member entries (e.g. protein name, gene name, comments),
including the conditions (taxonomic, metabolic or feature-based) that are used to control propagation.
- Keywords and Gene Ontology terms.
- Cross-references to protein-domain related databases (currently PROSITE, Pfam, PRINTS, TIGRFAMs
and/or PIRSF)
- Computed features (e.g. export signals, transmembrane regions) that may be applied to
entries by using appropriate prediction programs (see below, "Technical aspects, Computed features").
- Propagated features, such as metal binding sites, active sites, etc. These are annotated based
on the presence of conserved amino acids, as determined by aligning potential new members against a
multiple alignment (the seed alignment) that includes a characterized template entry (see below,
"Technical aspects, Propagated features").
- Characteristics of the family (e.g. size range, fusions, duplications, etc.).
In this section, the entry that was used as a template is also indicated; there can be more
than one template.
- Comments about the family.
Example: HAMAP rule MF_01962
For more details, please consult the
User Manual for the Web View. It is also available by clicking the headers of each section
of a HAMAP rule page.
Technical aspects
HAMAP and how it is updated
See the
HAMAP rules page for up-to-date statistics about the number and taxonomic
coverage of HAMAP rules and profiles.
See the
proteomes page for up-to-date statistics about the number of complete
prokaryotic proteomes currently available in UniProtKB and the coverage of HAMAP profiles in each proteome.
The HAMAP release is concurrent with every release of UniProtKB, which takes place every 4 weeks.
New families are added in each release, and existing families are periodically updated.
How to tell if a UniProtKB entry has been automatically annotated
Automatically annotated entries present these general features:
- A cross-reference to a HAMAP family profile it matches (see following section),
- Features and comments that are inferred by automatic methods are marked with the
adjectives "By similarity" or "Potential" (See the UniProt document
Non-experimental qualifiers) and the source of annotation is indicated at the end of the feature/comment
with an evidence tag pointing to the HAMAP annotation rule that generated the annotation.
An example:
A8A671.
Cross-references from UniProtKB to HAMAP
Cross-references are present in all UniProtKB entries that are matched by a HAMAP family profile (or several).
These cross-references are found in the "Cross-references/Family and domain databases" section of
the entry, and have the following format:
HAMAP;
profile-identifier; profile-name; count; status.
The identifiers are:
| Profile-identifier: |
Unique identifier for a HAMAP family profile |
| Profile-name: |
Name of the HAMAP family profile |
| Count: |
Number of domains found in the protein, generally '1', occasionally '2' for the
fusion of 2 identical domains. |
| Status: |
The values are either '-', 'fused', 'atypical' or 'atypical/fused'.
The value '-' is a placeholder for an empty field; the 'fused' value
indicates that the family rule does not cover the entire protein; the value
'atypical' points out that the protein is divergent in sequence or has mutated functional
sites, and should not be included in family datasets; the value 'atypical/fused' indicates
the last 2 findings. |
Example: HAMAP; MF_01885; tRNA_methyltr_TrmL; 1; -.
Feature propagation
Protein features (and associated comments and keywords) are propagated automatically using two different approaches.
- Propagated features
are propagated on the basis of their conservation throughout the family. The seed alignment(s) of
the family containing representative, manually annotated family entries is used to transfer
features from the annotation rule to family members, provided that conserved residues as indicated
in the rule are found at the corresponding positions.
- Computed features
are predicted using the following ad hoc methods:
HAMAP profiles
The HAMAP profiles that are used to identify family members are generated from a manually produced seed
alignment using an automatic procedure based on the method used to generate PROSITE profiles
(see Sigrist et al.,
Brief. Bioinform. 3(3):265-274 (2002)).
Accessing HAMAP
The most efficient and user-friendly way to access HAMAP data is to browse interactively the ExPASy server,
at
http://hamap.expasy.org.
Downloading HAMAP data
Annotation rules, profiles and alignments are available at the HAMAP ftp section
ftp://ftp.expasy.org/databases/hamap.
Frequently asked questions (FAQ)
Will HAMAP be extended to eukaryotes?
Yes. HAMAP was originally developed for the annotation of proteins from completely sequenced
bacteria, archaea and plastids, but we have now started to produce and integrate also HAMAP rules and
profiles targeting eukaryotic protein families.
What is the coverage of HAMAP in a genome?
Since family rules have been built with a bias toward well-studied phyla and housekeeping genes, the
coverage is dependent on the organism type and the genome size. HAMAP covers up to 64% of small genomes
such as
Buchnera aphidicola (subsp. Acyrthosiphon pisum),
to 25% for the model organism
Escherichia coli K12, to less than 6%
for the large genomes of some Streptomyces species (e.g.
Streptomyces
bingchenggensis).
Is it possible to annotate all the proteins of a new complete genome using HAMAP?
In certain new genomes, it is possible to annotate just over half of the proteins automatically with the
current set of rules. This coverage is constantly expanded with the addition of new rules.
However, the current approach is intrinsically limited to 'well-behaved' orthologous families.
Why and when do we merge genomes in UniProtKB?
Historically when a new genome arrived we used to merge it with all preexisting UniProt entries of that
bacterium or archaea, regardless of the strain of the new genome. As more and more strains of the same organism
were sequenced it has become evident that this is no longer appropriate or desirable. Thus we now usually
assign a new species code for each new strain of a particular organism, and then merge the new entries with
any entries of the same strain that already exist in UniProtKB/Swiss-Prot. Presently we only
merge complete microbial genomes when they are from the same strain.