------------ Under Internal Testing ------------

CopyKAT-Python

CopyKAT-Python is a Python reimplementation of the CopyKAT workflow for inferring large-scale copy number alterations (CNAs) from single-cell RNA-seq data. It reproduces the core CopyKAT strategy while improving scalability, usability, and integration with modern AnnData/Scanpy pipelines.

Why CopyKAT-Python?

The original CopyKAT-R package is widely used for distinguishing aneuploid tumor cells from diploid normal cells using scRNA-seq data. Recurring practical limitations include:

Long runtimes, with reports of >1 hour for ~8,000 cells
Inability to handle very large datasets (hundreds of thousands to millions of cells) due to hierarchical clustering limits

Highlights:

Identical core parameters as CopyKAT-R with convenient Python improvements
Handles datasets from thousands to hundreds of thousands of cells with significantly faster speed
Annotated CNA heatmaps with per-cell metadata sidebars (cell type, cluster labels, etc.)
Pre-built Singularity container for reproducible deployment
Validated across 16 human and mouse 10X datasets and a 170k-cell Xenium whole-transcript dataset

Installation

From source:

Installs copykat-py into your current environment

git clone https://github.com/navinlabcode/Copykat_python.git
cd Copykat_python
pip install -e .

From environment.yml with conda:

Creates a fresh conda environment for copykat-py with all required packages

git clone https://github.com/navinlabcode/Copykat_python.git
cd Copykat_python
conda env create -f environment.yml
conda activate copykit_py

## After activation, confirm the commands are available
copykat_matrix --help
copykat_anndata --help

Singularity container (recommended for HPC environments):

wget https://github.com/navinlabcode/Copykat_python/releases/download/v1.0.0/copykat_py.sif
singularity exec copykat_py.sif copykat-py --help

How to Run

CopyKAT-Python supports two main entry points:

Entry point	When to use
`copykat_matrix` / `copykat-py`	Input is a `.csv`, `.tsv`, or `.mtx` matrix file on disk in linux
`copykat_anndata()`	Input is an already-loaded `AnnData` object in python

Terminal — `copykat_matrix` / `copykat-py`

CSV or TSV matrix:

copykat_matrix \
    --input sample_counts.csv \
    --sample-name sample1 \
    --genome hg20 \
    --n-cores 24 \
    --output-dir results/sample1

10X matrix market input:

copykat_matrix \
    --input filtered_feature_bc_matrix/matrix.mtx.gz \
    --genes filtered_feature_bc_matrix/features.tsv.gz \
    --barcodes filtered_feature_bc_matrix/barcodes.tsv.gz \
    --sample-name sample1 \
    --genome hg20 \
    --n-cores 24 \
    --output-dir results/sample1

Pass --meta (and optionally --row-split) to produce an annotated heatmap alongside the standard output. See Annotated Heatmap with Metadata.

Python API — `copykat()`

import pandas as pd
from copykat_py import copykat

counts = pd.read_csv("sample_counts.csv", index_col=0)

result = copykat(
    rawmat=counts,
    id_type="S",
    sam_name="sample1",
    genome="hg20",
    distance="euclidean",
    n_cores=24,
)

print(result["prediction"].head())

rawmat can also be a dict with keys matrix, genes, and barcodes for sparse matrices.

Python API — `copykat_anndata()`

import anndata as ad
from copykat_py import copykat_anndata

adata = ad.read_h5ad("sample.h5ad")

result = copykat_anndata(
    adata=adata,
    selecting_meta=["CellType", "copykat_pred", "seurat_clusters"],
    row_split="CellType",
    sample_name="sample1",
    genome="hg20",
    distance="euclidean",
    n_cores=24,
    output_dir="results/sample1_anndata",
)

print(result["prediction"]["copykat.pred"].value_counts())

Useful options: layer (use adata.layers[...]), use_raw (use adata.raw), selecting_meta (export obs columns for annotated heatmaps), row_split (column defining row groups).

Output files

All entry points produce the same outputs as copykat-R:

*_copykat_CNA_results.txt
*_copykat_prediction.txt
*_copykat_heatmap.png
copykat_run.log

When metadata is supplied, an additional annotated heatmap PNG is produced. AnnData workflows also write *_selected_obs_meta.csv.

Annotated Heatmap with Metadata

Produce a CNA heatmap with per-cell metadata annotations (cell type, cluster labels, etc.). Rows are split into labelled groups by a chosen metadata column and ordered by hierarchical or K-means clustering within each group.

CLI — standalone re-plot (`copykat-py-plot`)

Re-plot from an existing CNA results file without re-running the full analysis:

copykat-py-plot \
    --cna  sample_copykat_CNA_results.txt \
    --meta xenium_ft_full_meta_celltype_leiden.csv \
    --row-split inferred_CellType \
    --sample-name xenium_all_cells \
    --n-cores 40 \
    --output xenium_annotated_heatmap.png

Flag	Default	Description
`--cna` / `-c`	(required)	`*_copykat_CNA_results.txt` from a copykat-py run
`--meta` / `-m`	(required)	Annotation CSV — first column = cell name, rest = metadata
`--row-split`	second column	Column used to split and label row groups

Meta CSV format — header row is auto-detected:

cell_name,leiden_cluster,inferred_CellType
aaaajgij-1,5,lumhr
aaaandia-1,5,lumhr
...

Cells present in the CNA results but absent from the CSV are labelled "unknown" and shown in grey. All remaining metadata columns are drawn as coloured annotation sidebars.

Python API — `plot_heatmap_annotated`

import pandas as pd
from copykat_py.plotting import plot_heatmap_annotated

cna = pd.read_csv("sample_copykat_CNA_results.txt", sep="\t")
plot_heatmap_annotated(
    mat           = cna.iloc[:, 3:].values.astype("float32"),
    cell_names    = cna.columns[3:].tolist(),
    chrom_info    = cna.iloc[:, 0].values,
    meta_csv      = "xenium_ft_full_meta_celltype_leiden.csv",
    row_split_col = "inferred_CellType",
    sample_name   = "xenium_all_cells",
    n_cores       = 40,
    output_path   = "xenium_annotated_heatmap.png",
)

Benchmarking and Validation

10X Dataset Validation

Both CopyKAT-R and CopyKAT-Python were tested on raw datasets (no QC filtering) using 24 cores. A total of 16 datasets spanning human and mouse tissues across multiple 10X platforms were used for validation.

10x Genomics Benchmark Datasets

Sample	Species	Tissue	Assay	Reported Cells
human_pbmc_10k_3pv3	human	PBMC healthy control	Universal 3' v3	11,769
human_nsclc_5pv1	human	NSCLC tumor	Universal 5' v1	7,802
human_ovarian_flex	human	Ovarian cancer FFPE	Flex	17,553
human_kidney_gemx_flex	human	Kidney nuclei control	GEM-X Flex	4,633
human_hodgkins_3pv31	human	Hodgkin's lymphoma	Universal 3' v3.1	3,394
human_breast_idc_7p5k_3pv31	human	Invasive ductal carcinoma	Universal 3' v3.1	5,680
human_breast_idc_750_lt_3pv31	human	Invasive ductal carcinoma	Universal 3' LT v3.1	687
human_melanoma_5p_nextgem	human	Melanoma dissociated tumor cells	Universal 5' NextGEM	6,704
mouse_brain_neurons_2k_v21	mouse	E18 brain neurons	Universal 3' v2.1	2,022
mouse_brain_neurons_10k_v3	mouse	Brain neurons	Universal 3' v3	11,843
mouse_heart_1k_v3	mouse	Heart E18	Universal 3' v3	1,011
mouse_heart_10k_v3	mouse	Heart E18	Universal 3' v3	7,713
mouse_brain_e18_10k_si_3pv31	mouse	E18 cortex hippocampus SVZ	Universal 3' v3.1 SI	11,316
mouse_kidney_nuclei_1k_3pv31	mouse	Adult kidney nuclei	Universal 3' v3.1	1,385
mouse_liver_nuclei_5k_3pv31	mouse	Adult liver nuclei	Universal 3' v3.1	6,311
mouse_brain_gemx	mouse	E18 brain neurons	Universal 3' GEM-X	12,441

Side-by-Side Comparison: CopyKAT-R vs CopyKAT-Python

human_pbmc_10k_3pv3

human_breast_idc_7p5k_3pv31

mouse_brain_e18_10k_si_3pv31

Key Metrics Comparison

Large-Scale Testing: Xenium Atera Dataset

The full FFPE Human Breast Cancer Xenium (Atera) dataset was subsetted to 50k, 100k, and full (~170k cells) to evaluate scalability.

Runtime Comparison

CNV Heatmap with Annotation

CNV comparison

Why Results May Differ from CopyKAT-R

From the above comparison of the final prediction, the Seurat cluster 4 was called diploid by CopyKAT-R but aneuploid by CopyKAT-Py. The copykat-py call was confirmed correct through the corresponding H&E cell morphology in this case.

The key difference is in the final prediction step (step 8), where both implementations perform hierarchical clustering on the adjusted CNA matrix and cut the tree at k=2. R's copykat explicitly uses method = "ward.D" in hclust(), while CopyKAT-Python uses scipy/fastcluster's "ward", which implements the mathematically correct ward.D2 criterion. For cells cluster (like Seurat cluster 4 here,) with subtle CNV profiles that sit near the boundary of the diploid/aneuploid split, the two linkage variants produce different dendrogram topologies, causing the binary label assignment to flip.

CopyKAT-Python results may not be identical to CopyKAT-R due to differences in:

High-confidence results typically show:

Clear chromosome-arm or whole-chromosome CNV patterns
Consistent CNV profiles within clusters
Strong separation between inferred diploid and aneuploid cells

Lower-confidence results may occur in samples with:

Weak CNV signal or low sequencing depth
Few normal reference cells
Strong batch effects
Near-diploid tumor genomes

Disclaimer: CopyKAT-Python is an independent reimplementation focused on scalability and usability, while faithfully reproducing the core CopyKAT analytical strategy.

Gene annotation versions
Filtering and preprocessing steps
Numerical implementation details
Smoothing and segmentation algorithms
Clustering behavior (parDist + hcluster vs. PCA + fastcluster)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

------------ Under Internal Testing ------------

CopyKAT-Python

Why CopyKAT-Python?

Installation

How to Run

Terminal — `copykat_matrix` / `copykat-py`

Python API — `copykat()`

Python API — `copykat_anndata()`

Output files

Annotated Heatmap with Metadata

CLI — standalone re-plot (`copykat-py-plot`)

Python API — `plot_heatmap_annotated`

Benchmarking and Validation

10X Dataset Validation

Side-by-Side Comparison: CopyKAT-R vs CopyKAT-Python

Large-Scale Testing: Xenium Atera Dataset

Why Results May Differ from CopyKAT-R

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

------------ Under Internal Testing ------------

CopyKAT-Python

Why CopyKAT-Python?

Installation

How to Run

Terminal — copykat_matrix / copykat-py

Python API — copykat()

Python API — copykat_anndata()

Output files

Annotated Heatmap with Metadata

CLI — standalone re-plot (copykat-py-plot)

Python API — plot_heatmap_annotated

Benchmarking and Validation

10X Dataset Validation

Side-by-Side Comparison: CopyKAT-R vs CopyKAT-Python

Large-Scale Testing: Xenium Atera Dataset

Why Results May Differ from CopyKAT-R

Terminal — `copykat_matrix` / `copykat-py`

Python API — `copykat()`

Python API — `copykat_anndata()`

CLI — standalone re-plot (`copykat-py-plot`)

Python API — `plot_heatmap_annotated`