 |
|
HAMAP annotation rules: User Manual for the Web View |
The HAMAP annotation rules are written in the
UniRule format, which is used by the UniProt
Knowledgebase (UniProtKB) automated annotation projects to annotate protein
records in the
UniProtKB
format. The rules can be displayed in a user-friendly Web View which consists of
the following three main sections and associated sub-sections.
General rule information
Accession
This section indicates the accession number of the rule, in the form
MF_xxxxx
Dates
This section is composed of two lines. The first line indicates the rule
creation date; the second corresponds to the last rule revision date.
Name
This section provides the name of the protein.
Scope
This section indicates the taxonomic range covered by the rule.
Template(s)
This section lists the UniProtKB accession numbers of the entries from which the rule's annotation was
inferred. The template entries are usually characterized. "Template: None"
indicates that there are no characterization papers on any of the proteins
that belong to that family. This is the case for UPFs (Uncharacterized
Protein Family), for example.
Triggered by
This section indicates the profile identifier(s) used to trigger the
application of the rule. The trigger can be either:
- A HAMAP profile derived
from the seed alignment of representative members of the family.
In this case the format is:
HAMAP; the profile identifier (e.g. HAMAP; MF_01322)
Clicking on the profile identifier displays the HAMAP profile page,
which shows general information about the profile and statistics for profile
matches in UniProtKB.
In most cases the HAMAP rule is triggered by only one profile. However
sometimes there are two profiles: one for bacteria (and plastids, if
applicable) and another for archaea. In these cases, the profile identifier
suffix is '_B' for bacteria (and plastids, if applicable) and '_A' for
archaea (e.g. MF_00036).
- A PROSITE profile
In this case the format is:
PROSITE; the profile identifier (e.g. PROSITE;
PS51244)
Clicking on the profile identifier displays the PROSITE profile
page, which shows general information about the profile and statistics for
profile matches in UniProtKB (e.g. MF_01708).
Propagated annotation
Identifier, protein and gene names
This section contains:
- An identifier: the mnemonic code for the protein name used in the
UniProtKB entry name
- The recommended protein name. It can also contain alternative names.
- The common gene name of the protein family, when it exists
Comments
This section contains all applicable comment lines of a UniProtKB entry
(see:
the General annotation (Comments) section of the UniProtKB User
Manual).
Keywords
This section contains all applicable keywords of a UniProtKB entry (see:
the Keywords
section of the UniProtKB User Manual).
Gene Ontology
This section contains all applicable GO terms and the corresponding
cross-references to the
Gene Ontology database.
Cross-references
This section indicates cross-references to domain databases within a
UniProtKB entry; currently PROSITE, Pfam, PRINTS, TIGRFAMs and PIRSF (see:
the
Cross-references section of the UniProtKB User Manual).
The format is:
Database Name identifier1;
identifier2;
number of
expected hits;
Pfam PF02033; RBFA; 1;
TIGRFAMs TIGR00082; rbfA; 1;
PROSITE PS01319; RBFA; 1;
Computed features
This section indicates which other rule(s) must be applied to completely
annotate the protein.
Two main cases can be distinguished:
- Triggering of Domain and/or Site rules:
This concerns rules that annotate domain(s) and/or site(s).
The format is:
PROSITE identifier1; identifier2; number of expected
hits; trigger=accession number of the rule to be
triggered
(e.g. PROSITE PS50035; PLD; 1; trigger=PRU00153;)
Clicking on the accession number of the rule displays the
corresponding annotation rule that is triggered.
- Triggering of other rule(s) to annotate features such as
Transmembrane, coiled coil:
In this case the format is:
General feature name; -; number of expected hits;
trigger=yes
(e.g. General Transmembrane; -; 6-10; trigger=yes;)
Features
This section contains:
- Template feature line(s), which defines the template for all the
subsequent Feature lines.
The format is:
From: template name
where template name is the identifier (ID and AC) of a sequence in
the seed alignment.
(e.g. From: ACP_ECOLI (P02901))
- Applicable feature lines that may be applied to UniProtKB entries
(e.g. ACT_SITE, METAL, see the Sequence annotation (Features)
section of the UniProtKB User Manual).
Conditions may be used in feature lines. They usually correspond to
pattern constraints, or to the presence of a specific amino acid.
e.g.
Key From To Description Condition
DISULFID 60 80 C-x*-C
Optional label can be used to indicate the presence of a feature which is
not mandatory in the matched sequences.
e.g.
Key From To Description Condition
BINDING (Optional) 153 153 ATP [RQ]
Multiple FT lines that should be applied either all together or not at all are
grouped within an "FTGroup", to force the common presence of all sites.
e.g.
Key From To Description Condition FTGroup
ACT_SITE 42 42 Charge relay system H 1
ACT_SITE 91 91 Charge relay system D 1
ACT_SITE 186 186 Charge relay system S 1
This group can then be referenced by
case statements in any other
annotation section to be propagated.
For instance:
case <FTGroup:1>
COFACTOR:
Name=Zn(2+); Xref=ChEBI:CHEBI:29105;
Note=Binds 1 zinc ion per subunit.;
end case
Additional information
- Size range: The minimal and
maximal sizes of proteins matching the rule are listed.
- Related Rules: Lists
identifiers of rules that are known to be similar in sequence, and which may
produce cross-matches. These are particularly useful when two different rules
exist for a short and long version of the same protein. Long proteins will
match both profiles; under these circumstances the longer family supersedes
the shorter family (e.g. MF_00344
supersedes MF_00345).
- Fusion: Indicates if at
least one rule member has been found fused to another protein/domain at its
N- or C-terminus. Fusion may be to another protein or to a known/unknown
domain.
- Comments on the
rule: This optional section contains additional useful information
including: 5-letter codes of organisms with possible wrong starts, divergent
paralogs, proteins that are excluded from alignment due to anomalies,
etc.