The Genetic Code and Symbol Position Budgets

Applying the bit budget principles to the genetic code

The four symbols of the genetic code are:

 

Code Letters and Nucleotides Represented

DNA nucleotide bases

A
adenine

G
guanine

T
thymine

C
cytosine

mRNA nucleotides

A
adenosine

G
guanosine

U
uradine

C
cytidine

 

purines

pyrimidines

Question: how could Watson and Crick (in 1953) anticipate a requirement for triplet (three-position) code units (or codons) for DNA?

Information at hand:

  • Certain proteins must be synthesized within cells for mammalian life. The "building blocks" of proteins are amino acids. Proteins are composed of sequences of amino acids.
  • There are twenty (20) amino acids required to build proteins (on ribosomes) in mammalian cells (some are synthesized in cells; some are needed in diet and cannot be synthesized in such cells
  • DNA (and mRNA) specify sequences of amino acids needed to synthesize the required proteins. (mRNA = messenger RNA). mRNA is read sequentially in the translation and synthesis process using a pattern matching process to match a codon to its complementary anticodon that is associated with the needed amino acid. Each code letter has its complement for transcription or translation.

Base

Complement

A

U

G

C

T or U

A

C

G

Thus the mRNA complement of the codon AUG would be the anticodon UAC.

  • If each mRNA codon identifies one amino acid, what is the minimum number of code letter positions required to uniquely identify at least 20 different proteins?
  • Note: codons can represent more than 20 different entities (in this case, amino acids).

The code must be able to represent 20 different entities.

If we have one symbol position, we can represent (in a 4-value system) four different single values. If we have two positions, we can represent 16 different values or unique combinations of two values (not quite enough! - and no control characters!) So we need at least three symbol positions (giving us 64 unique combinations of three values).

In reality, after the code was understood, one amino acid can be represented by 1, 2, 4, or 6 codons (this is called a degenerate code since it does not offer unique equivalences). Examples:

  • UGG (only) represents Trp = trypophan
  • AAU and AAC represent Asn = aspargine
  • GGU, GGC, GGA, and GGG all represent Gly = glycine
  • UCU, UCC, UCA, UCG, AGU, and AGC all represent Ser = serine

Three codons turned out to be "nonsense," "terminal," or "stop" codons with no amino acid translation: UAA, UAG, and UGA (è compare the role of control characters in computer codes and stop bits and flags in data communications). Start sequences have also been identified involving codons which represent amino acids (eg. Met = methionine in multicellular animals) (è compare headers and start bits and flags in data communications).

DNA is even divided into sections of code that are used (exons) and that code for protein segments and (originally presumed) "nonsense" sections of code (introns*) that are not used to code for protein segments. Replication can distinguish the intron sections and remove them as part of the process of deriving mRNA, splicing the exons together (è compare HTML documents with embedded SQL - the embedded SQL is "nonsense" as far as HTML interpretation is concerned - this analogy is limited as the significance of the introns in DNA remains a topic of conjecture).

Recommended reading on this topic:

  • Maxim D. Frank-Kametskii, Unraveling DNA: The Most Important Molecule of Life, rev. ed. (trans. by Lev Liapin), Perseus Books, 1997.
  • William D. Stansfield, Jaime S. Colomé, Raúl J. Cano, Molecular and Cell Biology, Schaum's Outlines, 1996.
  • The Human Genome Project: http://www.ornl.gov/TechResources/Human_Genome/home.html

·         For current understanding of introns, see: http://www.panspermia.org/introns.htm and http://post.queensu.ca/~forsdyke/introns.htm and http://www.ndsu.edu/pubweb/~mcclean/plsc731/transcript/transcript4.htm.

·         When this page was first set up in 2003, this was current information: see W. Wayt Gibbs, “The Unseen Genome: Gems among the Junk,” Scientific American 289, 5 (November 2003): 46-53. See also an NIH site on non-coding RNA genes. For the concept epigenetics, please see W. Wayt Gibbs , “The SeeUnseen Genome Beyond DNA,” Scientific American 289, 6 (December 2003): 106-113.

See also: i3450DNAComputing.htm

Thanks to Dr. Ordetta Mendoza, Head of the Department of Bioinformatics at Stella Maris College, Chennai, India, for reviewing this page (2006).

Valerie J. H. Powell, R.T.(R), Ph.D., C&IS Department (Wheatley Center), Robert Morris University