CoMBCR

Introduction

CoMBCR is an innovative B-cell embedding method designed to integrate multi-modal data from B cells, particularly BCRs and gene expressions, within a co-learning framework. By accepting paired BCR sequences and gene expression profiles as input, CoMBCR effectively integrates these two modalities to produce joint representations for each B cell, focusing specifically on the heavy chain of BCRs.

Prerequisites

CoMBCR is implemented in Python and requires a GPU for the acceleration.

We recommend the versions of the following packages:

Pytorch (2.4.1)
Transformers (4.41.2)
Numpy (1.26.4)
Pandas (2.2.3)
Scikit-learn (1.5.1)
huggingface_hub by python3 -m pip install huggingface_hub

Please install the following packages if you want to use the visualization functions:

anndata (0.9.2)
scanpy (1.9.8)
matplotlib (3.6.3)

Installation

Install CoMBCR using pip:

pip3 install CoMBCR

Then, install the default pre-trained encoder (The code only need to be executed once when install CoMBCR):

from CoMBCR.utils import download_BCRencoder
download_BCRencoder()

Tutorial

We provide a tutorial for the usage of CoMBCR. The following usage section is for the current version of CoMBCR.

Please refer to tutorial_pair if you want to use the paired chains. Kindly noted that the paired-chains will cost double computational resources and the performance won't increase significantly according to the current tested outcomes.

Usage

Prepare input data

CoMBCR integrates BCRs and gene expressions but requires three files: a BCR sequences file, a gene expression file, and a file containing BCR embeddings generated by a BCR encoder (e.g., AntiBERTa, ESM2).

Ensure each file includes an index column labeled "barcode," serving as a unique identifier for each cell.

Verify that the cells are aligned in the same order across all three files.
BCR sequences file

This CSV file should include an index column named "barcode" and columns labeled "fwr1", "cdr1", "fwr2", "cdr2", "fwr3", "cdr3" and "fwr4". The file should resemble the example shown below:

Gene expression file

Normalization and log-transformation are recommended. Batch effect removal is advisable if applicable. We suggest using the top 5,000 highly variable genes, though you can select input genes according to your criteria.

Original BCR embeddings file

Please clone or download the "runberta.py" in this github. This script generates the original BCR embeddings required for computing pairwise BCR distances in the CoMBCR framework. We recommend using our default pre-trained encoder, though any encoder can be used to encode BCRs.
python3 runberta.py --datapath "exampledata/example_bcr.csv" --outdir "example_outdir" --outfilename "antiberta_embedding.csv"
The code generates an original BCR embedding file named "antiberta_embedding.csv" under the outdir.
Quick run
To quickly run CoMBCR, use the following code:
from CoMBCR.CoMBCR import CoMBCR_main
bcremb, gexemb = CoMBCR_main(bcrpath="exampledata/example_bcr.csv", 
           rnapath="exampledata/example_rna.csv", 
           bcroriginal="exampledata/example_bcrori.csv", 
           outdir="example_outdir",
           epochs=1,  # You can revise the epochs here. Default if 200.
           batch_size=32,
           encoderprofile_in_dim=5000)
This code returns numpy arrays for BCR embeddings and gene expression embeddings, and outputs "bcrembedding.csv" and "gexembedding.csv" in the specified output directory.

Parameters of CoMBCR

Parameter Description

bcrpath (Required) The path to the BCR sequences file.

rnapath (Required) The path to the gene expression file.

bcroriginal (Required) The path to the BCR original embedding file.

outdir (Required) The directory where the best checkpoint file and the output embeddings will be stored.

checkpoint Default is "best_network.pth". This parameter specifies the name of the saved checkpoint.

lr Default is 1e-6.

lam Default is 1e-1. Intra-modal constrastive loss weight (α in paper).

batch_size Default is 256.

epochs Default is 200.

patience Default is 15, the patience for early stopping.

save_epoch Default is None. If specified (e.g., 150), saves the model at that epoch and exits training. By default, uses early stopping strategy.

lr_step Default is [50,100]. These are the milestones for the MultiStepLR setting, which adjusts the learning rate at specified epochs.

encoderprofile_in_dim Default is 5000. Adjust this parameter if the number of input genes differs from 5000.

separatebatch The default is False. If set to True, BCRs from different samples will be treated as distinct BCRs. Ensure that your BCR input file contains a "sample" column if you choose to enable this option.

user_defined_cluster Default is False. If set to True, the model utilizes custom cluster labels specified in the "cluster_label" column of the BCR input file for intra-modal contrastive learning.

Visualization

We provide functions to interpret the optimization performance and visualize the output embeddings.

1. Optimization performance

Use plot_training_loss to visualize the optimization process. This function plots three key loss components:

Cross-Modal Loss (L_cross): Measures cross-modal alignment. A decrease indicates the model is learning the correspondence between BCR and GEX modalities.
Profile Loss (L_p): Measures preservation of GEX intrinsic structure. A decrease indicates biological variation is being retained.
BCR Loss (L_b): Measures preservation of BCR intrinsic structure. A decrease indicates clonal relationships are being maintained.

Example:

from CoMBCR.visualization import plot_training_loss

# Visualize training progress
# Mode: 'earlystopping' (default) or 'save_epoch'
fig = plot_training_loss(
    log_path='example_outdir/CoMBCR.pth.log', # Path to the log file
    mode='earlystopping',
    save_path='training_loss.png'  # Optional: save figure
)

Key Parameters

log_path (required): Path to the training log file (e.g., 'output/CoMBCR.pth.log')
mode (default: 'earlystopping'): Set to 'save_epoch' if you designated a specific epoch to save
save_path (default: None): Path to save the output figure. If None, the figure is displayed but not saved

Output Figure

2. Joint Embedding Visualization

Use create_joint_embedding_adata to process the output embeddings into a Scanpy-compatible AnnData object for downstream analysis.

Example:

from CoMBCR.visualization import create_joint_embedding_adata
import scanpy as sc

# 1. Create AnnData with joint embeddings
adata = create_joint_embedding_adata(
    bcr_emb_path="example_outdir/Embeddings/bcr_embeddings.csv",
    gex_emb_path="example_outdir/Embeddings/gex_embeddings.csv",
    metadata="example_outdir/annotation.csv",  # Optional
)

# 2. Visualize using standard Scanpy workflow
sc.pl.umap(adata, color='celltypes', title='CoMBCR Joint Embedding')

Key Parameters

bcr_emb_path (str): Path to the generated BCR embedding CSV file.
gex_emb_path (str): Path to the generated GEX embedding CSV file.
metadata (str or pd.DataFrame): Optional. Path to your annotation file (or a DataFrame). The index should match the barcodes.

Output

The returned adata object is structured for flexible analysis:

adata.X: Stores the Joint Embeddings. Use this for clustering and global visualization.
adata.obsm['CoMBCR_bcr']: Stores the CoMBCR-BCR Embeddings.
adata.obsm['CoMBCR_gex']: Stores the CoMBCR-GEX Embeddings.

Output Figure

Visualize CoMBCR's Individual Modalities

Sometimes, users may want to inspect CoMBCR-BCR embeddings or CoMBCR-GEX embeddings. To visualize these embeddings separately, use create_sub_embedding_adatas:

from CoMBCR.visualization import create_sub_embedding_adatas
import scanpy as sc

# Extract separate embeddings from joint AnnData
bcr_adata, gex_adata = create_sub_embedding_adatas(
    adata,
    compute_umap=True
)

# Visualize BCR embedding
sc.pl.umap(bcr_adata, color='v_genes', title='BCR Embedding')

# Visualize GEX embedding
sc.pl.umap(gex_adata, color='celltypes', title='GEX Embedding')

Questions

If you encounter issues installing or using CoMBCR, please feel free to open an issue or contact me via email.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
downstream		downstream
example_outdir		example_outdir
example_pairdata		example_pairdata
exampledata		exampledata
images		images
src		src
.gitattributes		.gitattributes
LICENSE.md		LICENSE.md
README.md		README.md
runberta.py		runberta.py
runberta_pair.py		runberta_pair.py
setup.py		setup.py
tutorial.ipynb		tutorial.ipynb
tutorial_pair.ipynb		tutorial_pair.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CoMBCR

Introduction

Prerequisites

Installation

Tutorial

Usage

Prepare input data

BCR sequences file

Gene expression file

Original BCR embeddings file

Quick run

Parameters of CoMBCR

Visualization

1. Optimization performance

2. Joint Embedding Visualization

Questions

About

Uh oh!

Releases

Packages

Languages

Parameter	Description
bcrpath	(Required) The path to the BCR sequences file.
rnapath	(Required) The path to the gene expression file.
bcroriginal	(Required) The path to the BCR original embedding file.
outdir	(Required) The directory where the best checkpoint file and the output embeddings will be stored.
checkpoint	Default is "best_network.pth". This parameter specifies the name of the saved checkpoint.
lr	Default is 1e-6.
lam	Default is 1e-1. Intra-modal constrastive loss weight (α in paper).
batch_size	Default is 256.
epochs	Default is 200.
patience	Default is 15, the patience for early stopping.
save_epoch	Default is None. If specified (e.g., 150), saves the model at that epoch and exits training. By default, uses early stopping strategy.
lr_step	Default is [50,100]. These are the milestones for the MultiStepLR setting, which adjusts the learning rate at specified epochs.
encoderprofile_in_dim	Default is 5000. Adjust this parameter if the number of input genes differs from 5000.
separatebatch	The default is False. If set to True, BCRs from different samples will be treated as distinct BCRs. Ensure that your BCR input file contains a "sample" column if you choose to enable this option.
user_defined_cluster	Default is False. If set to True, the model utilizes custom cluster labels specified in the "cluster_label" column of the BCR input file for intra-modal contrastive learning.

License

deepomicslab/CoMBCR

Folders and files

Latest commit

History

Repository files navigation

CoMBCR

Introduction

Prerequisites

Installation

Tutorial

Usage

Prepare input data

BCR sequences file

Gene expression file

Original BCR embeddings file

Quick run

Parameters of CoMBCR

Visualization

1. Optimization performance

2. Joint Embedding Visualization

Questions

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages