HAMAP annotation rules: User Manual for the Web View

The HAMAP annotation rules are written in the UniRule format, which is used by the UniProt Knowledgebase (UniProtKB) automated annotation projects to annotate protein records in the UniProtKB format. The rules can be displayed in a user-friendly Web View which consists of the following three main sections and associated sub-sections.





General rule information

Accession
This section indicates the accession number of the rule, in the form MF_xxxxx

Dates
This section is composed of two lines. The first line indicates the rule creation date; the second corresponds to the last rule revision date.

Name
This section provides the name of the protein.

Scope
This section indicates the taxonomic range covered by the rule.

Triggered by
This section indicates the profile identifier(s) used to trigger the application of the rule. The trigger can be either:
  • A HAMAP profile derived from the seed alignment of representative members of the family.
    In this case the format is:
    HAMAP; the profile identifier (e.g. HAMAP; MF_01322)
    Clicking on the profile identifier displays the HAMAP profile page, which shows general information about the profile and statistics for profile matches in UniProtKB.
    In most cases the HAMAP rule is triggered by only one profile. However sometimes there are two profiles: one for bacteria (and plastids, if applicable) and another for archaea. In these cases, the profile identifier suffix is '_B' for bacteria (and plastids, if applicable) and '_A' for archaea (e.g. MF_00036).

  • A PROSITE profile
    In this case the format is:
    PROSITE; the profile identifier (e.g. PROSITE; PS51244)
    Clicking on the profile identifier displays the PROSITE profile page, which shows general information about the profile and statistics for profile matches in UniProtKB.

Propagated annotation

Identifier, protein and gene names
This section contains:
  • An identifier: the mnemonic code for the protein name used in the UniProtKB entry name
  • The recommended protein name. It can also contain alternative names.
  • The common gene name of the protein family, when it exists
Comments
This section contains all applicable comment lines of a UniProtKB entry (see: the General annotation (Comments) section of the UniProtKB User Manual).

Keywords
This section contains all applicable keywords of a UniProtKB entry (see: the Keywords section of the UniProtKB User Manual).

Gene Ontology
This section contains all applicable GO terms and the corresponding cross-references to the Gene Ontology database.

Cross-references
This section indicates cross-references to domain databases within a UniProtKB entry; currently PROSITE, Pfam, PRINTS, TIGRFAMs and PIRSF (see: the Cross-references section of the UniProtKB User Manual).
The format is:
Database Name identifier1; identifier2; number of expected hits;

        Pfam     PF02033; RBFA; 1;

        TIGRFAMs TIGR00082; rbfA; 1;

        PROSITE  PS01319; RBFA; 1;



Computed features
This section indicates which other rule(s) must be applied to completely annotate the protein.
Two main cases can be distinguished:
  • Triggering of Domain and/or Site rules:
    This concerns rules that annotate domain(s) and/or site(s).
    The format is:
    PROSITE identifier1; identifier2; number of expected hits; trigger=accession number of the rule to be triggered
    (e.g. PROSITE PS50035; PLD; 1; trigger=PRU00153;)
    Clicking on the accession number of the rule displays the corresponding annotation rule that is triggered.

  • Triggering of other rule(s) to annotate features such as Transmembrane, coiled coil:
    In this case the format is:
    General feature name; -; number of expected hits; trigger=yes
    (e.g. General Transmembrane; -; 6-10; trigger=yes;)
Features
This section contains:
  1. Template feature line(s), which defines the template for all the subsequent Feature lines.
    The format is:
    From: template name
    where template name is the identifier (ID and AC) of a sequence in the seed alignment.
    (e.g. From: ACP_ECOLI (P02901))
  2. Applicable feature lines that may be applied to UniProtKB entries (e.g. ACT_SITE, METAL, see the FT section of the UniProtKB User Manual).

Conditions may be used in feature lines. They usually correspond to pattern constraints, or to the presence of a specific amino acid.
e.g.

Key             From            To              Description             Condition

DISULFID          60            80              By similarity            C-x*-C

Optional label can be used to indicate the presence of a feature which is not mandatory in the matched sequences.
e.g.

Key                    From             To              Description             Condition

BINDING (Optional)      153            153              ATP (By similarity)        [RQ]

Multiple FT lines that should be applied either all together or not at all are grouped within an "FTGroup", to force the common presence of all sites.
e.g.

Key         From    To     Description                              Condition        FTGroup

ACT_SITE      42    42     Charge relay system (By similarity)          H               1

ACT_SITE      91    91     Charge relay system (By similarity)          D               1

ACT_SITE     186   186     Charge relay system (By similarity)          S               1

This group can then be referenced by case statements in any other annotation section to be propagated.
For instance:

case  <FTGroup:1>

  COFACTOR: Binds 1 zinc ion per subunit (By similarity).

end case


Additional information

  • Size range: The minimal and maximal sizes of proteins matching the rule are listed.

  • Related Rules: Lists identifiers of rules that are known to be similar in sequence, and which may produce cross-matches. These are particularly useful when two different rules exist for a short and long version of the same protein. Long proteins will match both profiles; under these circumstances the longer family supersedes the shorter family (e.g. MF_00344 supersedes MF_00345).

  • Template(s): Lists the accession numbers of the entries from which the rule's annotation was inferred. The template entries are usually characterized. "Template: None" indicates that there are no characterization papers on any of the proteins that belong to that family. This is the case for UPFs (Uncharacterized Protein Family), for example.

  • Fusion: Indicates if at least one rule member has been found fused to another protein/domain at its N- or C-terminus. Fusion may be to another protein or to a known/unknown domain.

  • Comments on the rule: This optional section contains additional useful information including: 5-letter codes of organisms with possible wrong starts, divergent paralogs, proteins that are excluded from alignment due to anomalies, etc.