Calculation of codon usage#

This script creates a table for codon usage in the Escherichia coli K-12 strain. Codon usage describes the frequencies of the codons that code for an amino acid. These frequencies are expected to reflect the abundance of the respective tRNAs in the target organism. The codon usage differs from species to species. Hence, it is important to look at the codon usage of the organism of interest for applications like codon optimization etc.

For the computation of the codon usage we will have a look into the annotated Escherichia coli K-12 genome: The script extracts all coding sequences (CDS) from the genome and counts the total number of each codon in these sequences. Then the relative frequencies are calculated by dividing the total number of occurrences of each codon by the total number of occurrences of the respective amino acid. In order to improve the performance, the script mostly works with symbol codes (see tutorial) instead of the symbols itself.

At first we fetch the genome from the NCBI Entrez database (Accession: U00096) as GenBank file and parse it into an AnnotatedSequence. Then we create a dictionary that will store the total codon frequencies later on. As already mentioned, the script works with symbol codes. Consequently, each codon in the dictionary is described as 3 integers instead of 3 letters.

# Code source: Patrick Kunzmann
# License: BSD 3 clause

import itertools
import tempfile
import numpy as np
import biotite.database.entrez as entrez
import biotite.sequence as seq
import biotite.sequence.io.fasta as fasta
import biotite.sequence.io.genbank as gb

# Get the E. coli K-12 genome as annotated sequence
gb_file = gb.GenBankFile.read(
    entrez.fetch("U00096", tempfile.gettempdir(), "gb", "nuccore", "gb")
)
# We are only interested in CDS features
k12_genome = gb.get_annotated_sequence(gb_file, include_only=["CDS"])


# This dictionary will count how often each codon occurs in the genome
# For increased performance the dictionary uses symbol codes ([0 3 2])
# instead of symbols (['A' 'T' 'G']) as keys
codon_counter = {
    codon: 0
    for codon in itertools.product(*([range(len(k12_genome.sequence.alphabet))] * 3))
}
# For demonstration purposes print the 64 codons in symbol code form
print(list(codon_counter.keys()))
[(0, 0, 0), (0, 0, 1), (0, 0, 2), (0, 0, 3), (0, 1, 0), (0, 1, 1), (0, 1, 2), (0, 1, 3), (0, 2, 0), (0, 2, 1), (0, 2, 2), (0, 2, 3), (0, 3, 0), (0, 3, 1), (0, 3, 2), (0, 3, 3), (1, 0, 0), (1, 0, 1), (1, 0, 2), (1, 0, 3), (1, 1, 0), (1, 1, 1), (1, 1, 2), (1, 1, 3), (1, 2, 0), (1, 2, 1), (1, 2, 2), (1, 2, 3), (1, 3, 0), (1, 3, 1), (1, 3, 2), (1, 3, 3), (2, 0, 0), (2, 0, 1), (2, 0, 2), (2, 0, 3), (2, 1, 0), (2, 1, 1), (2, 1, 2), (2, 1, 3), (2, 2, 0), (2, 2, 1), (2, 2, 2), (2, 2, 3), (2, 3, 0), (2, 3, 1), (2, 3, 2), (2, 3, 3), (3, 0, 0), (3, 0, 1), (3, 0, 2), (3, 0, 3), (3, 1, 0), (3, 1, 1), (3, 1, 2), (3, 1, 3), (3, 2, 0), (3, 2, 1), (3, 2, 2), (3, 2, 3), (3, 3, 0), (3, 3, 1), (3, 3, 2), (3, 3, 3)]

As expected the dictionary encodes each codon as tuple of 3 numbers, where 0 represents A, 1 C, 2 G and 3 T. These mappings are defined by the alphabet of the genomic sequence.

In the next step the occurrences of codons in the coding sequences are counted, the relative frequencies are calculated and the codon table is printed.

# Iterate over all CDS features
for cds in k12_genome.annotation:
    # Obtain the sequence of the CDS
    cds_seq = k12_genome[cds]
    if len(cds_seq) % 3 != 0:
        # A CDS' length should be a multiple of 3,
        # otherwise the CDS is malformed
        continue
    # Iterate over the sequence in non-overlapping frames of 3
    # and count the occurence of each codon
    for i in range(0, len(cds_seq), 3):
        codon_code = tuple(cds_seq.code[i : i + 3])
        codon_counter[codon_code] += 1

# Convert the total frequencies into relative frequencies
# for each amino acid
# The NCBI codon table with ID 11 is the bacterial codon table
table = seq.CodonTable.load(11)
# As the script uses symbol codes, each amino acid is represented by a
# number between 0 and 19, instead of the single letter code
for amino_acid_code in range(20):
    # Find all codons coding for the amino acid
    # The codons are also in symbol code format, e.g. ATG -> (0, 3, 2)
    codon_codes_for_aa = table[amino_acid_code]
    # Get the total amount of codon occurrences for the amino acid
    total = 0
    for codon_code in codon_codes_for_aa:
        total += codon_counter[codon_code]
    # Replace the total frequencies with relative frequencies
    # and print it
    for codon_code in codon_codes_for_aa:
        # Convert total frequencies into relative frequencies
        codon_counter[codon_code] /= total
        # The rest of this code block prints the codon usage table
        # in human readable format
        amino_acid = seq.ProteinSequence.alphabet.decode(amino_acid_code)
        # Convert the symbol codes for each codon into symbols...
        # ([0,3,2] -> ['A' 'T' 'G'])
        codon = k12_genome.sequence.alphabet.decode_multiple(codon_code)
        # ...and represent as string
        # (['A' 'T' 'G'] -> "ATG")
        codon = "".join(codon)
        freq = codon_counter[codon_code]
        print(f"{amino_acid}   {codon}   {freq:.2f}")
    print()
A   GCA   0.21
A   GCC   0.27
A   GCG   0.36
A   GCT   0.16

C   TGC   0.56
C   TGT   0.44

D   GAC   0.37
D   GAT   0.63

E   GAA   0.69
E   GAG   0.31

F   TTC   0.43
F   TTT   0.57

G   GGA   0.11
G   GGC   0.41
G   GGG   0.15
G   GGT   0.34

H   CAC   0.43
H   CAT   0.57

I   ATA   0.07
I   ATC   0.42
I   ATT   0.51

K   AAA   0.77
K   AAG   0.23

L   CTA   0.04
L   CTC   0.10
L   CTG   0.50
L   CTT   0.10
L   TTA   0.13
L   TTG   0.13

M   ATG   1.00

N   AAC   0.55
N   AAT   0.45

P   CCA   0.19
P   CCC   0.12
P   CCG   0.53
P   CCT   0.16

Q   CAA   0.35
Q   CAG   0.65

R   AGA   0.04
R   AGG   0.02
R   CGA   0.06
R   CGC   0.40
R   CGG   0.10
R   CGT   0.38

S   AGC   0.28
S   AGT   0.15
S   TCA   0.12
S   TCC   0.15
S   TCG   0.15
S   TCT   0.15

T   ACA   0.13
T   ACC   0.44
T   ACG   0.27
T   ACT   0.17

V   GTA   0.15
V   GTC   0.22
V   GTG   0.37
V   GTT   0.26

W   TGG   1.00

Y   TAC   0.43
Y   TAT   0.57

The codon usage table can be used to optimize recombinant protein expression by designing the DNA sequence for the target protein according to the codon usage. This is called codon optimization. However, there is a variety of different algorithms for codon optimization. For simplicity reasons, this example uses an approach, where for every amino acid always the most frequently occuring codon is used.

In the follwing we will derive a codon optimized DNA sequence of streptavidin for expression in E. coli K-12.

# For each amino acid, select the codon with maximum codon usage
# Again, symbol codes are used here
opt_codons = {}
for amino_acid_code in range(20):
    codon_codes_for_aa = table[amino_acid_code]
    # Find codon with maximum frequency
    max_freq = 0
    best_codon_code = None
    for codon_code in codon_codes_for_aa:
        if codon_counter[codon_code] > max_freq:
            max_freq = codon_counter[codon_code]
            best_codon_code = codon_code
    # Map the amino acid to the codon with maximum frequency
    opt_codons[amino_acid_code] = best_codon_code

# Fetch the streptavidin protein sequence from Streptomyces avidinii
fasta_file = fasta.FastaFile.read(
    entrez.fetch("P22629", None, "fasta", "protein", "fasta")
)
strep_prot_seq = fasta.get_sequence(fasta_file)
# Create a DNA sequence from the protein sequence
# using the optimal codons
strep_dna_seq = seq.NucleotideSequence()
strep_dna_seq.code = np.concatenate(
    [opt_codons[amino_acid_code] for amino_acid_code in strep_prot_seq.code]
)
# Add stop codon
strep_dna_seq += seq.NucleotideSequence("TAA")
# Put the DNA sequence into a FASTA file
fasta_file = fasta.FastaFile()
fasta_file["Codon optimized streptavidin"] = str(strep_dna_seq)
# Print the contents of the created FASTA file
print(fasta_file)
# In a real application it would be written onto the hard drive via
# fasta_file.write("some_file.fasta")
>Codon optimized streptavidin
ATGCGCAAAATTGTGGTGGCGGCGATTGCGGTGAGCCTGACCACCGTGAGCATTACCGCGAGCGCGAGCGCGGATCCGAG
CAAAGATAGCAAAGCGCAGGTGAGCGCGGCGGAAGCGGGCATTACCGGCACCTGGTATAACCAGCTGGGCAGCACCTTTA
TTGTGACCGCGGGCGCGGATGGCGCGCTGACCGGCACCTATGAAAGCGCGGTGGGCAACGCGGAAAGCCGCTATGTGCTG
ACCGGCCGCTATGATAGCGCGCCGGCGACCGATGGCAGCGGCACCGCGCTGGGCTGGACCGTGGCGTGGAAAAACAACTA
TCGCAACGCGCATAGCGCGACCACCTGGAGCGGCCAGTATGTGGGCGGCGCGGAAGCGCGCATTAACACCCAGTGGCTGC
TGACCAGCGGCACCACCGAAGCGAACGCGTGGAAAAGCACCCTGGTGGGCCATGATACCTTTACCAAAGTGAAACCGAGC
GCGGCGAGCATTGATGCGGCGAAAAAAGCGGGCGTGAACAACGGCAACCCGCTGGATGCGGTGCAGCAGTAA

Gallery generated by Sphinx-Gallery