ddSeq

In this tutorial we'll show how to use STARsolo to process ddSeq scRNA-seq data, and compare the results to those generated by ddSeeker--another tool for processing ddSeq data. We'll be using a dataset provided by Illumina which contains a mixture of human and mouse cells and is used in the ddSeeker paper.

Required Data

Dataset

The dataset used in this tutorial must be obtained through Illumina BaseSpace. Through BaseSpace, download the fastqs for Sample A of the "NextSeq SureCell WTA 3 Mixed Species HEK 3T3 1400 cell" dataset. As in the ddSeeker paper, we'll only be using Sample A of the dataset due to the long running time of ddSeeker.

Reference Genome

Note that this dataset contains both mouse and human cells and requires a mixed species reference. If you don't have one handy, an easy way to create one from individual human and mouse references is by using the CellRanger function mkref, which looks like:

cellranger mkref --genome=hg19 --fasta=/path/to/human_genome.fa --genes=/path/to/human_genes.gtf --genome=mm10 --fasta=/path/to/mouse_genome.fa --genes=/path/to/mouse_genes.gtf

The cellranger function should generate the STAR genome index automatically. But if you already have combined fasta and gtf files, you can just generate the genome index using STAR as follows:

STAR --runThreadN 20 --runMode genomeGenerate --genomeDir /path/to/index/output --genomeFastaFiles genome.fa --sjdbGTFfile genes.gtf

Whitelist

The other thing we'll need is a ddSeq whitelist, which can be found in this repository here.

Running STARsolo

We'll assume you have STAR (>=2.7.5c) and have the appropriate reference genome index as described above.

You can use STARsolo to process ddSeq data by simply passing the following additional parameters which tell STARsolo how to parse the barcodes:

--soloCBposition 2_-6_2_-1 3_1_3_6 3_22_3_27 
--soloUMIposition 3_31_3_39 
--soloAdapterSequence TAGCCATCGCATTGC 
--soloCBmatchWLtype 1MM

Here is an example of the full command. Despite the large number of parameters it is straightforward: just replace the paths to the fastq files, genome index, and whitelist to those on your system:

STAR --readFilesCommand zcat --soloType CB_UMI_Complex --soloCBwhitelist [ddSeq.whitelist] 
--soloFeatures Gene --runThreadN 1 --outSAMtype None --genomeDir [path/to/index] 
--readFilesIn [/path/to/sampleA_R2.fastq] [/path/to/sampleA_R1.fastq] --soloCBposition 2_-6_2_-1 3_1_3_6 3_22_3_27 
--soloUMIposition 3_31_3_39 --soloAdapterSequence TAGCCATCGCATTGC --soloCBmatchWLtype 1MM
--genomeLoad LoadAndKeep --readMapNumber -1

This will output the usual count matrices which can be processed with downstream libraries, as will be covered below.

Running ddSeeker

You can skip this section if you're just interested in running STARsolo, but for completeness we'll cover running ddSeeker here; there are a few things to note that aren't spelled out in the ddSeeker documentation and should be noted for reproducibility. First, obtain ddSeeker from their github and install the required python libraries. As they suggest, make sure you specifically use pip to install the dependencies. We'll assume you have Python>=3.4 installed.

git clone https://github.com/cgplab/ddSeeker.git

pip install biopython
pip install pysam

You will also need the dropseq-tools library in order to generate the count matrices. Note that you need to download an old version of the library that is compatible with their script, or change some of the function calls to their updated equivalents. The version used for the data generated for this tutorial is v1.13, which can be downloaded here.

Before running ddSeeker, we also need to generate an additional file which should be placed in the directory which contains your reference fasta (provided to the reference option). This can be created using picard (which is included as part of dropseq-tools) with the following command:

java -jar /path/to/Drop-seq_tools-1.13/3rdParty/picard/picard.jar CreateSequenceDictionary R=genome.fa O=genome.dict

Finally, we're going to make a small modification to the script to make things much easier and allow comparison with the STARsolo output. By default the script looks for an outdated refFlat file to pass to dropseq-tools TagReadToGeneExon. However, this function can also accept a gtf without modification, and so we'll just change the script to pass this instead. Change line 120 pointing to a refFlat file to the path to the gtf file you ran STARsolo with.

Old line:

refflat=${reference%.fa*}.refFlat

New line:

refflat=/path/to/genes.gtf

We're now ready to run ddSeeker using their provided scripts. Make sure you provide the correct paths on your system to all of the executables and files required by ddSeeker:

-d <dropseq_root>   : Directory containing Drop-seq executables.  Required.
-g <genomedir>      : Directory of STAR genome directory.  Required.
-r <referencefasta> : Reference fasta of the Drop-seq reference metadata bundle.  Required.
-s <STAR_path>      : Full path of STAR.  Default: 'STAR'
-k <ddSeeker_path>  : Full path of ddSeeker.  Default: 'ddSeeker.py'

It is also highly recommended to set the number of cores above the default of 1 using -n. Be warned, the runtime using 20 cores on a high performance server is ~7 hrs. With everything in order, you should now be able to run ddSeeker (and tagging with dropseq-tools) using the script provided in their github repo (replace [options] with your paths etc):

ddSeeker/code/ddSeeker_dropSeq_tools.sh [options] /path/to/sampleA_R1.fastq.gz /path/to/sampleA_R2.fastq.gz

But wait, we still don't have a count matrix. This can also be done with dropseq-tools using the DigitalExpression function. Here, we output the top 350 cells as in the ddSeeker paper. The tagged bam file can be found in the output directory provided to the previous script.

Drop-seq_tools-1.13/DigitalExpression I=star_gene_exon_tagged.bam O=star_gene_exon_tagged.dge.txt.gz SUMMARY=star_gene_exon_tagged.dge.summary.txt NUM_CORE_BARCODES=350

Processing the Count Matrices

We'll be using Pandas and ScanPy to compare the results from STARsolo and ddSeeker both quantitatively and qualitatively. Load the appropriate libraries:

import pandas as pd
import scanpy as sc
import numpy as np

First we'll load and process the counts with ScanPy for both methods. For ddSeeker the top 350 cells were output as in the paper.

dds_adata = sc.read_text('/path/to/ddSeeker/counts/star_gene_exon_tagged.dge.txt',
                           first_column_names=True)
dds_adata = dds_adata.T

Next read STARsolo counts. To read in the STARsolo counts with ScanPy's read_10x_mtx function, there's one thing we need to do which is to change the name of the gene list in the STARsolo output folder. In {prefix}Solo.out/Gene/filtered/ rename features.tsv to genes.tsv.

solo_adata = sc.read_10x_mtx('{prefix}Solo.out/Gene/filtered/')

Then we need to filter/align genes and barcodes to be the same for both methods. Note that the barcodes output from STARsolo need a little processing to align with those output from ddSeeker as the three separate barcodes are output separated by an underscore.

# Align the gene sets
solo_adata = solo_adata[:, dds_adata.var_names]

# Remove the underscores from STARsolo barcodes so they can be matched with those from ddSeeker
solo_adata.obs_names = solo_adata.obs_names.str.replace('_', '')

# Find the common barcodes present in the STARsolo and ddSeeker output
common_cells = dds_adata.obs_names.intersection(solo_adata.obs_names)

# Filter the STARsolo and ddSeeker data to the common barcodes
solo_adata = solo_adata[common_cells]
dds_adata = dds_adata[common_cells]

After alignment there's a difference of two cells in the output of ddSeeker and STARsolo for a total of 348.

Analysis

First, we'll look at a qualitative analysis of how the counts correlate between the two methods on a count-by-count basis. Then, we'll look at how the results compare qualitatively by comparing the clustering and differential expression from ScanPy. We expect there to be some small differences in the results due to differences in the algorithms that produce the gene counts, but they should be qualitatively similar.

To compare the counts, we're going to convert the count matrices from the AnnData objects into pure dataframes and stack them for easier processing.

counts_ss_df = solo_adata.to_df().stack()
counts_dd_df = dds_adata.to_df().stack()

To start off we'll look at the raw correlations between the counts for each cell/gene:

from scipy.stats import pearsonr

all_df = pd.concat([counts_ss_df, counts_dd_df], axis=1)
all_df.columns = ["STARsolo", "ddSeeker"]

all_df.corr()
all_df.corr(method='spearman')
all_df.transform(lambda x: np.log2(x+1)).corr()

From this we see that the Pearson correlation of the raw counts is 0.921, the Spearman correlation is 0.971, and the Pearson correlation of the log2 transformed counts (most relevent for analysis) is 0.975. These numbers are within the range expected for this kind of comparison.

As a more detailed analysis, we'll look at the distribution of the magnitudes of differences for individual counts. To do this we'll plot the percentage of counts that have a certain magnitude of difference between the methods. We'll also make a cutoff at 10 counts to be included in the comparison to make the results more meaningful. We'll be using plotly for plotting.

import plotly.express as px
import plotly.figure_factory as ff

count_threshold = 10

df1 = counts_dd_df.astype(np.float32)
df2 = counts_ss_df.astype(np.float32)

# Mask out low counts
mask_vals = (df1 < count_threshold) & (df2 < count_threshold)
df1.mask(mask_vals, inplace=True)
df2.mask(mask_vals, inplace=True)
df1.replace(0.0, np.nan, inplace=True)
df2.replace(0.0, np.nan, inplace=True)

# Compute the log2 fold differences
ratios = np.abs(np.log2(df1 / df2)).dropna()

# Make the plot with a bin size of 0.2
fig = ff.create_distplot([ratios.to_numpy()], ['Ratio counts'], show_rug=False,
                         show_curve=False, bin_size=0.2, histnorm='probability')
fig.update_xaxes(title_text='Fold Difference (log2)')
fig.update_yaxes(title_text='Percent of counts')
fig.update_layout(title_text='Percentage of counts (above 10 counts) with a certain fold difference')
fig.show('svg')

We can see that most counts have a small difference, and there are few counts with extreme differences.

Secondly, we'll see how the counts correlate as a function of the size of the counts by computing the correlation only considering counts above a certain threshold.

threshold_max = 100

r_values = []
for threshold_val in range(0, threshold_max, 2):
    # Mask out values less than threshold_val from being considered in the correlation
    df1_filtered = counts_dd_df.mask(counts_dd_df < threshold_val)
    df2_filtered = counts_ss_df.mask(counts_ss_df < threshold_val)

    df1_filtered.transform(lambda x: np.log2(x+1))
    df2_filtered.transform(lambda x: np.log2(x+1))

    corr = df1_filtered.corr(df2_filtered)
    r_values.append(corr)

r_df = pd.DataFrame(zip(r_values, range(0, threshold_max, 2)), columns=["r", "Threshold"])
px.line(r_df, x='Threshold', y='r').update_layout(title_text='Pearson correlation when only considering counts above a certain threshold').show('svg')

Now we'll look at how the downstream results compare by running both counts through a standard ScanPy pipeline and comparing the clustering and DE.

Do filtering and compare some QC metrics:

sc.pp.filter_cells(solo_adata, min_genes=200)
sc.pp.filter_genes(solo_adata, min_cells=3)
sc.pp.filter_cells(dds_adata, min_genes=200)
sc.pp.filter_genes(dds_adata, min_cells=3)

sc.pp.calculate_qc_metrics(solo_adata, percent_top=None, log1p=False, inplace=True)
sc.pp.calculate_qc_metrics(dds_adata, percent_top=None, log1p=False, inplace=True)

# STARsolo
sc.pl.violin(solo_adata, ['n_genes_by_counts', 'total_counts'],
             jitter=0.4, multi_panel=True)

# ddSeeker
sc.pl.violin(dds_adata, ['n_genes_by_counts', 'total_counts'],
             jitter=0.4, multi_panel=True)

Run a standard preprocessing pipeline and compare UMAP/clustering:

def apply_preprocessing(adata):
    sc.pp.normalize_total(adata, target_sum=1e4)
    sc.pp.log1p(adata)
    sc.pp.highly_variable_genes(adata, min_mean=0.0125, max_mean=3, min_disp=0.5)
    adata.raw = adata
    adata = adata[:, adata.var.highly_variable]
    sc.pp.scale(adata, max_value=10)
    sc.tl.pca(adata, svd_solver='arpack')
    sc.pp.neighbors(adata, n_neighbors=10, n_pcs=40)
    sc.tl.umap(adata)
    sc.tl.leiden(adata)
    sc.tl.rank_genes_groups(adata, 'leiden')
    return adata

solo_adata = apply_preprocessing(solo_adata)
dds_adata = apply_preprocessing(dds_adata)

Compare UMAP/clustering:

# STARsolo
sc.pl.umap(solo_adata, color='leiden')

# ddSeeker
sc.pl.umap(dds_adata, color='leiden')

Finally we'll look at a differential expression heatmap between the clusters to see how the gene expression in each cluster compares. Although there are some differences in the finer subclustering, both methods are similar and successfully deconvolve the mouse and human cells, which correspond to the two large superclusters in the UMAP/blocks in the heatmap.

# STARsolo
sc.pl.rank_genes_groups_heatmap(solo_adata, n_genes=3)

# ddSeeker
sc.pl.rank_genes_groups_heatmap(dds_adata, n_genes=3)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!