CopyKAT-Python is a Python reimplementation of the CopyKAT workflow for inferring large-scale copy number alterations (CNAs) from single-cell RNA-seq data. It reproduces the core CopyKAT strategy while improving scalability, usability, and integration with modern AnnData/Scanpy pipelines.
The original CopyKAT-R package is widely used for distinguishing aneuploid tumor cells from diploid normal cells using scRNA-seq data. Recurring practical limitations include:
- Long runtimes, with reports of >1 hour for ~8,000 cells
- Inability to handle very large datasets (hundreds of thousands to millions of cells) due to hierarchical clustering limits
Highlights:
- Identical core parameters as CopyKAT-R with convenient Python improvements
- Handles datasets from thousands to hundreds of thousands of cells with significantly faster speed
- Annotated CNA heatmaps with per-cell metadata sidebars (cell type, cluster labels, etc.)
- Pre-built Singularity container for reproducible deployment
- Validated across 16 human and mouse 10X datasets and a 170k-cell Xenium whole-transcript dataset
From source:
Installs copykat-py into your current environment
git clone https://github.com/navinlabcode/Copykat_python.git
cd Copykat_python
pip install -e .From environment.yml with conda:
Creates a fresh conda environment for copykat-py with all required packages
git clone https://github.com/navinlabcode/Copykat_python.git
cd Copykat_python
conda env create -f environment.yml
conda activate copykit_py
## After activation, confirm the commands are available
copykat_matrix --help
copykat_anndata --helpSingularity container (recommended for HPC environments):
wget https://github.com/navinlabcode/Copykat_python/releases/download/v1.0.0/copykat_py.sif
singularity exec copykat_py.sif copykat-py --helpCopyKAT-Python supports two main entry points:
| Entry point | When to use |
|---|---|
copykat_matrix / copykat-py |
Input is a .csv, .tsv, or .mtx matrix file on disk in linux |
copykat_anndata() |
Input is an already-loaded AnnData object in python |
CSV or TSV matrix:
copykat_matrix \
--input sample_counts.csv \
--sample-name sample1 \
--genome hg20 \
--n-cores 24 \
--output-dir results/sample110X matrix market input:
copykat_matrix \
--input filtered_feature_bc_matrix/matrix.mtx.gz \
--genes filtered_feature_bc_matrix/features.tsv.gz \
--barcodes filtered_feature_bc_matrix/barcodes.tsv.gz \
--sample-name sample1 \
--genome hg20 \
--n-cores 24 \
--output-dir results/sample1Pass --meta (and optionally --row-split) to produce an annotated heatmap alongside the standard output. See Annotated Heatmap with Metadata.
import pandas as pd
from copykat_py import copykat
counts = pd.read_csv("sample_counts.csv", index_col=0)
result = copykat(
rawmat=counts,
id_type="S",
sam_name="sample1",
genome="hg20",
distance="euclidean",
n_cores=24,
)
print(result["prediction"].head())rawmat can also be a dict with keys matrix, genes, and barcodes for sparse matrices.
import anndata as ad
from copykat_py import copykat_anndata
adata = ad.read_h5ad("sample.h5ad")
result = copykat_anndata(
adata=adata,
selecting_meta=["CellType", "copykat_pred", "seurat_clusters"],
row_split="CellType",
sample_name="sample1",
genome="hg20",
distance="euclidean",
n_cores=24,
output_dir="results/sample1_anndata",
)
print(result["prediction"]["copykat.pred"].value_counts())Useful options: layer (use adata.layers[...]), use_raw (use adata.raw), selecting_meta (export obs columns for annotated heatmaps), row_split (column defining row groups).
All entry points produce the same outputs as copykat-R:
*_copykat_CNA_results.txt*_copykat_prediction.txt*_copykat_heatmap.pngcopykat_run.log
When metadata is supplied, an additional annotated heatmap PNG is produced. AnnData workflows also write *_selected_obs_meta.csv.
Produce a CNA heatmap with per-cell metadata annotations (cell type, cluster labels, etc.). Rows are split into labelled groups by a chosen metadata column and ordered by hierarchical or K-means clustering within each group.
Re-plot from an existing CNA results file without re-running the full analysis:
copykat-py-plot \
--cna sample_copykat_CNA_results.txt \
--meta xenium_ft_full_meta_celltype_leiden.csv \
--row-split inferred_CellType \
--sample-name xenium_all_cells \
--n-cores 40 \
--output xenium_annotated_heatmap.png| Flag | Default | Description |
|---|---|---|
--cna / -c |
(required) | *_copykat_CNA_results.txt from a copykat-py run |
--meta / -m |
(required) | Annotation CSV — first column = cell name, rest = metadata |
--row-split |
second column | Column used to split and label row groups |
Meta CSV format — header row is auto-detected:
cell_name,leiden_cluster,inferred_CellType
aaaajgij-1,5,lumhr
aaaandia-1,5,lumhr
...Cells present in the CNA results but absent from the CSV are labelled "unknown" and shown in grey. All remaining metadata columns are drawn as coloured annotation sidebars.
import pandas as pd
from copykat_py.plotting import plot_heatmap_annotated
cna = pd.read_csv("sample_copykat_CNA_results.txt", sep="\t")
plot_heatmap_annotated(
mat = cna.iloc[:, 3:].values.astype("float32"),
cell_names = cna.columns[3:].tolist(),
chrom_info = cna.iloc[:, 0].values,
meta_csv = "xenium_ft_full_meta_celltype_leiden.csv",
row_split_col = "inferred_CellType",
sample_name = "xenium_all_cells",
n_cores = 40,
output_path = "xenium_annotated_heatmap.png",
)Both CopyKAT-R and CopyKAT-Python were tested on raw datasets (no QC filtering) using 24 cores. A total of 16 datasets spanning human and mouse tissues across multiple 10X platforms were used for validation.
10x Genomics Benchmark Datasets
| Sample | Species | Tissue | Assay | Reported Cells |
|---|---|---|---|---|
| human_pbmc_10k_3pv3 | human | PBMC healthy control | Universal 3' v3 | 11,769 |
| human_nsclc_5pv1 | human | NSCLC tumor | Universal 5' v1 | 7,802 |
| human_ovarian_flex | human | Ovarian cancer FFPE | Flex | 17,553 |
| human_kidney_gemx_flex | human | Kidney nuclei control | GEM-X Flex | 4,633 |
| human_hodgkins_3pv31 | human | Hodgkin's lymphoma | Universal 3' v3.1 | 3,394 |
| human_breast_idc_7p5k_3pv31 | human | Invasive ductal carcinoma | Universal 3' v3.1 | 5,680 |
| human_breast_idc_750_lt_3pv31 | human | Invasive ductal carcinoma | Universal 3' LT v3.1 | 687 |
| human_melanoma_5p_nextgem | human | Melanoma dissociated tumor cells | Universal 5' NextGEM | 6,704 |
| mouse_brain_neurons_2k_v21 | mouse | E18 brain neurons | Universal 3' v2.1 | 2,022 |
| mouse_brain_neurons_10k_v3 | mouse | Brain neurons | Universal 3' v3 | 11,843 |
| mouse_heart_1k_v3 | mouse | Heart E18 | Universal 3' v3 | 1,011 |
| mouse_heart_10k_v3 | mouse | Heart E18 | Universal 3' v3 | 7,713 |
| mouse_brain_e18_10k_si_3pv31 | mouse | E18 cortex hippocampus SVZ | Universal 3' v3.1 SI | 11,316 |
| mouse_kidney_nuclei_1k_3pv31 | mouse | Adult kidney nuclei | Universal 3' v3.1 | 1,385 |
| mouse_liver_nuclei_5k_3pv31 | mouse | Adult liver nuclei | Universal 3' v3.1 | 6,311 |
| mouse_brain_gemx | mouse | E18 brain neurons | Universal 3' GEM-X | 12,441 |
Key Metrics Comparison
The full FFPE Human Breast Cancer Xenium (Atera) dataset was subsetted to 50k, 100k, and full (~170k cells) to evaluate scalability.
CNV Heatmap with Annotation
From the above comparison of the final prediction, the Seurat cluster 4 was called diploid by CopyKAT-R but aneuploid by CopyKAT-Py. The copykat-py call was confirmed correct through the corresponding H&E cell morphology in this case.
The key difference is in the final prediction step (step 8), where both implementations perform hierarchical clustering on the adjusted CNA matrix and cut the tree at k=2. R's copykat explicitly uses method = "ward.D" in hclust(), while CopyKAT-Python uses scipy/fastcluster's "ward", which implements the mathematically correct ward.D2 criterion. For cells cluster (like Seurat cluster 4 here,) with subtle CNV profiles that sit near the boundary of the diploid/aneuploid split, the two linkage variants produce different dendrogram topologies, causing the binary label assignment to flip.
CopyKAT-Python results may not be identical to CopyKAT-R due to differences in:
High-confidence results typically show:
- Clear chromosome-arm or whole-chromosome CNV patterns
- Consistent CNV profiles within clusters
- Strong separation between inferred diploid and aneuploid cells
Lower-confidence results may occur in samples with:
- Weak CNV signal or low sequencing depth
- Few normal reference cells
- Strong batch effects
- Near-diploid tumor genomes
Disclaimer: CopyKAT-Python is an independent reimplementation focused on scalability and usability, while faithfully reproducing the core CopyKAT analytical strategy.
- Gene annotation versions
- Filtering and preprocessing steps
- Numerical implementation details
- Smoothing and segmentation algorithms
- Clustering behavior (parDist + hcluster vs. PCA + fastcluster)




