BBF RFC-13: Rethinking the boundaries and composition of coding regions
19 November 2008
Related RFCs: 9, 11, 12
Keywords: protein fusions, domains, protein tags, assembly
With the advent of several assembly standards fostering in-frame
protein domain fusions, it is important to rethink our categorization
of parts to allow the documentation and distribution of parts
containing only a portion of a protein coding region. This RFC
attempts to document initial thoughts on the naming and documentation
of such sub-coding region parts.
Proteins typically consist of one or more domains, sequences of amino
acids which fold relatively independently and which are evolutionarily
shuffled as a unit among different protein coding regions. The DNA
sequence of such domains must maintain in-frame translation, and thus
is a multiple of three bases.
In our older assembly technology, the assembly scar was 8 bases long,
and failed to maintain the coding region frame. Several proposals for
new assembly techniques, including the Ira Phillips proposal, Bam/Bgl,
BB-2 (see RFCs 11, 12, 14), and blunt scarless assembly, allow in-frame
composition of protein domains.
The N-terminal domain of a protein coding region is special in a
number of ways. First, it always contains a start codon, spaced at an
appropriate distance from a ribosomal binding site. Second, many
coding regions have special features at the N terminus, such as
protein export tags and lipoprotein cleavage and attachment tags.
These function when internal to a coding region, and therefore are termed
Similarly, the C-terminal domain of a protein is special, containing
at least a stop codon. Other special features, such as degradation
tags, are also required to be at the extreme C-terminus. Again, these
domains cannot function when internal to a coding region, and are
termed Tail domains.
Each coding region will consist logically of at least three domains, a
Head domain, one or more internal domains, and a tail domain. A part
in the registry may (similar to any composite part) consist of a
composition of domains. In particular, existing coding regions
consist of a particularly simple Head domain (the start codon), a
single internal domain, and a simple Tail domain (the stop codon).
(1) Head Domain: The Head Domain consists of the start codon followed
immediately by zero or more triplets specifiying an N-terminal
tag, such as a protein export tag or lipoprotein binding tag.
(2) Internal Domains: Internal domains consist of a series of codon triplets
coding for an amino acid sequence without a start codon or stop
codon. Multiple Internal Domains can be fused.
(3) Special Internal Domains: Short Internal Domains with specific function may be
separately categorized, but obey the same composition rules as
normal Internal domains. Special Internal Domains include tags, linkers,
(4) Tail Domain: The Tail Domain consists of zero or more
triplet codons, followed by a pair of TAA stop codons. In the
simplest case, the stop codons terminate the protein with an
Stop. More complex Tail Domains may include degradation tags
appropriate to the organism (with different degradation rates,
Note that different assembly techniques will, in general, result in
different amino acid sequences for coding regions composed out of the
same Head, Tail, and Internal Domains. We anticipate that users will use
care in thinking about the effects of such differences on their
experiments, but also feel confident that many such differences will
be minor, when the composition uses structures such as export tags,
degradation tails, and purification tags.
RFC 14 describes the use of these concepts in combination with BB-2