Quantifying gene expression from RNA-seq data#

This example demonstrates how Biotite’s alignment functionalities can be used to map reads from RNA sequencing to a genome (or more precisely to cDNA data). This enables counting the number of transcripts for each gene in the expression data. These raw counts can be used downstream for transcriptomics analyses, which is out of scope of Biotite.

Note

The approach shown here shows only the backbone of a read mapper. For writing an actual program, this example should be extended, e.g. with more precise acceptance criteria for alignments.

# Code source: Patrick Kunzmann
# License: BSD 3 clause

import functools
import gzip
import multiprocessing
from io import StringIO
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import requests
import biotite
import biotite.application.sra as sra
import biotite.sequence as seq
import biotite.sequence.align as align
import biotite.sequence.io.fasta as fasta
import biotite.sequence.io.fastq as fastq

# The number of processes for read mapping
N_PROCESS = 2
# This example script is only a demonstration
# -> decrease number of processed reads to decrease its run time
EXCERPT = 100000
# A flat Phred quality score threshold under which a read is ignored
QUALITY_THRESHOLD = 30
# k-mer length for matching reads to genes
K = 15
# Window length of minimizers
WINDOW = 10
# Band width for banded gapped alignment
BAND_WIDTH = 10
# Number of highest expressed genes to display
N_TOP_LIST = 20
# URL of the cDNA for the genome of interest
CDNA_URL = (
    "https://ftp.ensemblgenomes.ebi.ac.uk/pub/plants/release-57/fasta/"
    "arabidopsis_thaliana/cdna/Arabidopsis_thaliana.TAIR10.cdna.all.fa.gz"
)
# SRA UID for the RNA-seq data
READS_UID = "SRR6919890"

Fetching the data#

For the purpose of this example expression data from the plant Arabidopsis thaliana is analyzed. The sequence reads are downloaded from the NCBI Sequence Read Archive (SRA).

app = sra.FastqDumpApp(READS_UID)
app.start()
app.join()
# Single-ended -> Only one FASTQ
fastq_path = app.get_file_paths()[0]

To quantify the expression from the RNA-seq data, we need a reference, to which the reads can be aligned to. Since RNA-seq reads are at hand a fitting reference are the transcript sequences (cDNA) of the genome from the species, the RNA-seq data was recorded from.

We could have generated the cDNA sequences ourselves by reading the gene coordinates from a GFF file (via biotite.sequence.io.gff.GFFFile) and extracting the corresponding sequences from the genome sequence FASTA file (via biotite.sequence.io.fasta.FastaFile). However, to keep this example more focussed, the precomputed cDNA sequences are simply fetched from EnsemblPlants [1].

The following code reads each entry in the cDNA FASTA file and extracts the gene symbols, i.e. the ‘names’ of the genes, and the corresponding cDNA sequences.

def get_gene_symbol(header):
    fields = header.split()
    for field in fields:
        if field.startswith("gene_symbol:"):
            # Get only the actual gene symbol
            # -> remove 'gene_symbol:' prefix
            return field.replace("gene_symbol:", "")
    # No gene symbol for this cDNA (e.g. non-coding)
    return None


response = requests.get(CDNA_URL)
fasta_content = gzip.decompress(response.content).decode("UTF-8")

gene_symbols = []
sequences = []
for header, seq_string in fasta.FastaFile.read_iter(StringIO(fasta_content)):
    symbol = get_gene_symbol(header)
    if symbol:
        # Check if the length is large enough to be used in the
        # k-mer table (see below)
        if len(seq_string) < K + WINDOW - 1:
            print(f"Ignored {symbol}: Cannot be indexed")
            continue
        gene_symbols.append(symbol)
        try:
            # Use the unambiguous alphabet (ACGT) to increase the length
            # of k-mers later on:
            # The k-mer code in restricted to int64, so a larger number
            # of base alphabet codes decreases the *k* that fits into
            # the integer type
            sequences.append(seq.NucleotideSequence(seq_string, ambiguous=False))
        except seq.AlphabetError:
            # For the simplicity of this example just ignore sequences
            # with unambiguous symbols
            # This applies only to a few cDNA sequences
            print(f"Ignored {symbol}: Contains ambiguous symbol")

print("\nExcerpt of genes:\n")
for symbol in gene_symbols[:20]:
    print(symbol)
Ignored AT4G33735: Cannot be indexed
Ignored AT2G08986: Contains ambiguous symbol
Ignored ORC4: Contains ambiguous symbol
Ignored ORC4: Contains ambiguous symbol
Ignored AT5G24575: Cannot be indexed
Ignored REF4: Contains ambiguous symbol
Ignored AT1G33612: Contains ambiguous symbol

Excerpt of genes:

AER
AT4G32100
AT2G43120
AT2G43120
AT1G30814
AT1G30814
AT1G30814
AT1G30814
PUB29
PUM6
PUM6
PUM6
PUM6
PUM6
AT1G78930
AT3G07950
AT4G03450
AT4G03450
AT1G71695
AT1G58983

Aligning reads to the reference#

The final aim is to obtain for each read an alignment to the cDNA sequence, i.e. the gene, it originates from. However, performing the usual sequence alignment of each read to every cDNA sequence would be computationally infeasible. Instead a fast matching step is performed first to select only those cDNA sequences that probably would give a high-scoring alignment for a read.

Here the general approach from the read mapper Minimap [2] is adopted: The cDNA sequences are first decomposed into k-mers. Then the minimizers are chosen from the k-mers [3]. In short, the minimizer is the smallest k-mer in a running window of k-mers. The effect is that the number of k-mers to be matched against later is drastically reduced, while the sensitivity of finding a match with the correct cDNA is still good, if they are highly similar. This assumption only holds, if sequencing is conducted with high fidelity. For fast matching the minimizers of all cDNA sequences are indexed into a BucketKmerTable, the memory-efficient twin of the KmerTable: While a :class:`KmerTable would require $4^k approx$ 1 billion buckets (one for each k-mer), the class:BucketKmerTable limits the number of buckets, but requires a bit more computation time for its construction and matching.

base_alph = seq.NucleotideSequence.alphabet_unamb
kmer_alph = align.KmerAlphabet(base_alph, K)
min_selector = align.MinimizerSelector(kmer_alph, WINDOW, align.RandomPermutation())

kmer_table = align.BucketKmerTable.from_kmer_selection(
    kmer_alph, *zip(*[min_selector.select(sequence) for sequence in sequences])
)

Now the reads can be matched to the indexed cDNA sequences. For this purpose they need to be processed in the same way: They are decomposed into k-mers and the minimizers are selected. After matches to certain cDNA sequences have been identified, the read is aligned to each of these sequences. As noted before, this script assumes sequencing was performed with high fidelity. Thus, the expected probability of indels is relatively small. This circumstance can be leveraged to decrease the computation time even further: Instead of allowing an arbitrary number of gaps between the read and the cDNA, the number of gaps is restricted [4]. Hence, not the entire alignment space is explored, but only a thin band. The required width of the band can be computed based on the indel probability [5], but for the sake of brevity a flat constant is used here. After all alignments have been collected, simply the highest-scoring one is chosen as the correct one.

def map_read(read_string, kmer_table, gene_sequences, substitution_matrix):
    try:
        read = seq.NucleotideSequence(read_string, ambiguous=False)
    except seq.AlphabetError:
        # There are a few reads that may contain unambiguous symbols
        # For the same reason as explained above, these are ignored
        return

    # Fast matching of minimizers
    matches = kmer_table.match_kmer_selection(*min_selector.select(read))
    if len(matches) == 0:
        # No matching gene found for read
        return
    # The probability that a read matches a gene at two different
    # positions that would give distinct alignments is tiny
    # -> For each gene take only the first matched position
    matched_gene_indices, indices = np.unique(matches[:, 1], return_index=True)
    matched_diagonals = matches[indices, 2] - matches[indices, 0]

    # For each matched gene perform a more thorough gapped alignment
    alignments = [
        (
            gene_i,
            align.align_banded(
                read,
                gene_sequences[gene_i],
                substitution_matrix,
                band=(diagonal - BAND_WIDTH, diagonal + BAND_WIDTH),
                gap_penalty=-10,
                max_number=1,
            )[0],
        )
        for gene_i, diagonal in zip(matched_gene_indices, matched_diagonals)
    ]
    # We assume that the best alignment is the correct one
    gene_index, alignment = max(alignments, key=lambda ali: ali[1].score)
    return gene_index, alignment

Let’s perform a read mapping for a single read.

substitution_matrix = align.SubstitutionMatrix.std_nucleotide_matrix()

for i, (_, (seq_string, q)) in enumerate(
    fastq.FastqFile.read_iter(fastq_path, offset="Sanger")
):
    # For demonstration only a single clean read is mapped
    if i == 3:
        read_string = seq_string
        break

gene_index, alignment = map_read(
    read_string, kmer_table, sequences, substitution_matrix
)

print(f"Match: {gene_symbols[gene_index]}")
print(alignment)
Match: MYB49
GGTTGATCTAGACCAGAAAATCCAAATGGTCAAAGAACTTGATGGACATGG
GGTTGATCTAGACCAGAAAATCCAAATGGTCAAAGAACTTGATGGACATGT

For the thousands of reads that need to be mapped here, it is reasonable to divide the work into multiple processes. As the BucketKmerTable needs to be copied to each of the spawned processes, a long startup time can be expected. However, for the large number of reads which can be then processed in parallel, it is still worth it.

def read_iter(fastq_path):
    for i, (_, (read_string, quality)) in enumerate(
        fastq.FastqFile.read_iter(fastq_path, offset="Sanger")
    ):
        # For the purpose of this example only a faction of the reads
        # are processed to save computation time
        if i >= EXCERPT:
            break
        # Very simple filtering of low-quality reads
        if np.mean(quality) < QUALITY_THRESHOLD:
            continue
        yield read_string


with multiprocessing.Pool(processes=N_PROCESS) as p:
    # Use multiprocessing to map reads to genes
    # and remove non-mappable reads (None values) afterwards
    mapping_results = list(
        filter(
            lambda mapping: mapping is not None,
            p.map(
                functools.partial(
                    map_read,
                    kmer_table=kmer_table,
                    gene_sequences=sequences,
                    substitution_matrix=substitution_matrix,
                ),
                read_iter(fastq_path),
            ),
        )
    )

Now the genes are counted: For each read, the count of the gene corresponding to the aligned cDNA is incremented. These counts can be used as input for further analysis transcriptomics pipelines. For the scope of this example simply the most abundant genes are displayed. The alignment itself is also discarded here, but note that it could also be used in downstream analysis.

Read alignments are typically stored in file formats like SAM/BAM [6]. The package pysam provides an interface to these formats. To convert an alignment into a CIGAR string, the function write_alignment_to_cigar() can be used.

counts = np.zeros(len(sequences), dtype=int)
for gene_index, alignment in mapping_results:
    counts[gene_index] += 1

# Show most expressed genes first
order = np.argsort(counts)[::-1]
ranked_gene_symbols = [gene_symbols[i] for i in order]
ranked_counts = counts[order]
# Put into dataframe for prettier printing
counts = pd.DataFrame(
    {"gene_symbol": ranked_gene_symbols, "count": ranked_counts},
    index=np.arange(1, len(ranked_counts) + 1),
)

# Show Top N
top_counts = counts[:N_TOP_LIST]
top_counts
gene_symbol count
1 AT3G19460 735
2 AT5G55040 509
3 MYR2 364
4 RBCS2B 272
5 ROF1 270
6 AT1G17610 251
7 AT2G07795 173
8 AT1G62085 162
9 TSO2 150
10 AT1G09890 146
11 AT1G54310 138
12 atpE 138
13 AT3G11130 133
14 AT5G45410 132
15 AT2G30600 131
16 IQD28 124
17 AT2G01667 123
18 AT3G18250 115
19 BGLU8 108
20 AT3G44180 101


Finally the top expressed genes are plotted.

figure, ax = plt.subplots(figsize=(8.0, 6.0), constrained_layout=True)
ax.barh(top_counts["gene_symbol"], top_counts["count"], color=biotite.colors["orange"])
ax.invert_yaxis()
ax.set_title(f"Top {N_TOP_LIST} expressed genes", weight="semibold")
ax.set_xlabel("Counts")
plt.show()
Top 20 expressed genes

References#

Gallery generated by Sphinx-Gallery