HAMAP annotation rules: User Manual for the Web View
The HAMAP annotation rules are written in the UniRule format , which is used by the UniProt Knowledgebase (UniProtKB) automated annotation projects to annotate protein records in the UniProtKB format. The rules can be displayed in a user-friendly Web View which consists of the following three main sections and associated sub-sections.
General rule information
Accession
This section indicates the accession number of the rule, in the form MF_xxxxx
Dates
This section is composed of two lines. The first line indicates the rule creation date; the second corresponds to the last rule revision date.
Name
This section provides the name of the protein.
Scope
This section indicates the taxonomic range covered by the rule.
Template(s)
This section lists the UniProtKB accession numbers of the entries from which the rule's annotation was inferred. The template entries are
usually characterized. "Template: None" indicates that there are no characterization papers on any of the proteins that belong to that family.
This is the case for UPFs (Uncharacterized Protein Family), for example.
Triggered by
This section indicates the profile identifier(s) used to trigger the application of the rule. The trigger can be either:
-
A HAMAP profile
derived from the seed alignment of representative members of the family.
In this case the format is:
HAMAP; the profile identifier (e.g. HAMAP; MF_01322)
Clicking on the profile identifier displays the HAMAP profile page, which shows general information about the profile and statistics for profile matches in UniProtKB.
In most cases the HAMAP rule is triggered by only one profile. However sometimes there are two profiles: one for bacteria (and plastids, if applicable) and another for archaea. In these cases, the profile identifier suffix is '_B' for bacteria (and plastids, if applicable) and '_A' for archaea (e.g. MF_00036 ).
-
A PROSITE profile
In this case the format is:
PROSITE; the profile identifier (e.g. PROSITE; PS51244)
Clicking on the profile identifier displays the PROSITE profile page, which shows general information about the profile and statistics for profile matches in UniProtKB (e.g. MF_01708 ).
Propagated annotation
Identifier, protein and gene names
This section contains:
- An identifier: the mnemonic code for the protein name used in the UniProtKB entry name
- The recommended protein name. It can also contain alternative names.
- The common gene name of the protein family, when it exists
Comments
This section contains all applicable comment lines of a UniProtKB entry (see: the General annotation (Comments) section of the UniProtKB User Manual ).
Keywords
This section contains all applicable keywords of a UniProtKB entry (see: the
Keywords section of the UniProtKB User Manual ).
Gene Ontology
This section contains all applicable GO terms and the corresponding cross-references to the Gene Ontology database .
Cross-references
This section indicates cross-references to domain databases within a UniProtKB entry; currently PROSITE, Pfam, PRINTS, TIGRFAMs and PIRSF
(see: the Cross-references section of the UniProtKB User
Manual ).
The format is:
Database Name identifier1; identifier2; number of expected hits;
Pfam PF02033; RBFA; 1; TIGRFAMs TIGR00082; rbfA; 1; PROSITE PS01319; RBFA; 1;
Computed features
This section indicates which other rule(s) must be applied to completely annotate the protein.
Two main cases can be distinguished:
-
Triggering of Domain and/or Site rules:
This concerns rules that annotate domain(s) and/or site(s).
The format is:
PROSITE identifier1; identifier2; number of expected hits; trigger=accession number of the rule to be triggered
(e.g. PROSITE PS50035; PLD; 1; trigger=PRU00153;)
Clicking on the accession number of the rule displays the corresponding annotation rule that is triggered.
-
Triggering of other rule(s) to annotate features such as Transmembrane, coiled coil:
In this case the format is:
General feature name; -; number of expected hits; trigger=yes
(e.g. General Transmembrane; -; 6-10; trigger=yes;)
Features
This section contains:
-
Template feature line(s), which defines the template for all the subsequent Feature lines.
The format is:
From: template name
where template name is the identifier (ID and AC) of a sequence in the seed alignment.
(e.g. From: ACP_ECOLI (P02901)) -
Applicable feature lines that may be applied to UniProtKB entries (e.g. ACT_SITE, METAL, see the Sequence annotation (Features) section of the
UniProtKB User Manual ).
e.g.
Key From To Description Condition DISULFID 60 80 C-x*-COptional label can be used to indicate the presence of a feature which is not mandatory in the matched sequences.
e.g.
Key From To Description Condition BINDING (Optional) 153 153 ATP [RQ]Multiple FT lines that should be applied either all together or not at all are grouped within an "FTGroup", to force the common presence of all sites.
e.g.
Key From To Description Condition FTGroup ACT_SITE 42 42 Charge relay system H 1 ACT_SITE 91 91 Charge relay system D 1 ACT_SITE 186 186 Charge relay system S 1This group can then be referenced by case statements in any other annotation section to be propagated.
For instance:
case <FTGroup:1> COFACTOR: Name=Zn(2+); Xref=ChEBI:CHEBI:29105; Note=Binds 1 zinc ion per subunit.; end case
Additional information
-
Size range: The minimal and maximal sizes of proteins matching the rule are listed.
-
Related Rules: Lists identifiers of rules that are known to be similar in
sequence, and which may produce cross-matches. These are particularly useful when two different rules exist for a short and long
version of the same protein. Long proteins will match both profiles; under these circumstances the longer family supersedes the
shorter family (e.g. MF_00344 supersedes MF_00345 ).
-
Fusion: Indicates if at least one rule member has been found fused to another
protein/domain at its N- or C-terminus. Fusion may be to another protein or to a known/unknown domain.
- Comments on the rule: This optional section contains additional useful information including: 5-letter codes of organisms with possible wrong starts, divergent paralogs, proteins that are excluded from alignment due to anomalies, etc.