Skip to content

Latest commit

 

History

History
308 lines (218 loc) · 12.8 KB

File metadata and controls

308 lines (218 loc) · 12.8 KB

------------ Under Internal Testing ------------

CopyKAT-Python

CopyKAT-Python is a Python reimplementation of the CopyKAT workflow for inferring large-scale copy number alterations (CNAs) from single-cell RNA-seq data. It reproduces the core CopyKAT strategy while improving scalability, usability, and integration with modern AnnData/Scanpy pipelines.

Why CopyKAT-Python?

The original CopyKAT-R package is widely used for distinguishing aneuploid tumor cells from diploid normal cells using scRNA-seq data. Recurring practical limitations include:

  • Long runtimes, with reports of >1 hour for ~8,000 cells
  • Inability to handle very large datasets (hundreds of thousands to millions of cells) due to hierarchical clustering limits

Highlights:

  • Identical core parameters as CopyKAT-R with convenient Python improvements
  • Handles datasets from thousands to hundreds of thousands of cells with significantly faster speed
  • Annotated CNA heatmaps with per-cell metadata sidebars (cell type, cluster labels, etc.)
  • Pre-built Singularity container for reproducible deployment
  • Validated across 16 human and mouse 10X datasets and a 170k-cell Xenium whole-transcript dataset

Installation

From source:

Installs copykat-py into your current environment

git clone https://github.com/navinlabcode/Copykat_python.git
cd Copykat_python
pip install -e .

From environment.yml with conda:

Creates a fresh conda environment for copykat-py with all required packages

git clone https://github.com/navinlabcode/Copykat_python.git
cd Copykat_python
conda env create -f environment.yml
conda activate copykit_py

## After activation, confirm the commands are available
copykat_matrix --help
copykat_anndata --help

Singularity container (recommended for HPC environments):

wget https://github.com/navinlabcode/Copykat_python/releases/download/v1.0.0/copykat_py.sif
singularity exec copykat_py.sif copykat-py --help

How to Run

CopyKAT-Python supports two main entry points:

Entry point When to use
copykat_matrix / copykat-py Input is a .csv, .tsv, or .mtx matrix file on disk in linux
copykat_anndata() Input is an already-loaded AnnData object in python
image

Terminal — copykat_matrix / copykat-py

CSV or TSV matrix:

copykat_matrix \
    --input sample_counts.csv \
    --sample-name sample1 \
    --genome hg20 \
    --n-cores 24 \
    --output-dir results/sample1

10X matrix market input:

copykat_matrix \
    --input filtered_feature_bc_matrix/matrix.mtx.gz \
    --genes filtered_feature_bc_matrix/features.tsv.gz \
    --barcodes filtered_feature_bc_matrix/barcodes.tsv.gz \
    --sample-name sample1 \
    --genome hg20 \
    --n-cores 24 \
    --output-dir results/sample1

Pass --meta (and optionally --row-split) to produce an annotated heatmap alongside the standard output. See Annotated Heatmap with Metadata.

Python API — copykat()

import pandas as pd
from copykat_py import copykat

counts = pd.read_csv("sample_counts.csv", index_col=0)

result = copykat(
    rawmat=counts,
    id_type="S",
    sam_name="sample1",
    genome="hg20",
    distance="euclidean",
    n_cores=24,
)

print(result["prediction"].head())

rawmat can also be a dict with keys matrix, genes, and barcodes for sparse matrices.

Python API — copykat_anndata()

import anndata as ad
from copykat_py import copykat_anndata

adata = ad.read_h5ad("sample.h5ad")

result = copykat_anndata(
    adata=adata,
    selecting_meta=["CellType", "copykat_pred", "seurat_clusters"],
    row_split="CellType",
    sample_name="sample1",
    genome="hg20",
    distance="euclidean",
    n_cores=24,
    output_dir="results/sample1_anndata",
)

print(result["prediction"]["copykat.pred"].value_counts())

Useful options: layer (use adata.layers[...]), use_raw (use adata.raw), selecting_meta (export obs columns for annotated heatmaps), row_split (column defining row groups).

Output files

All entry points produce the same outputs as copykat-R:

  • *_copykat_CNA_results.txt
  • *_copykat_prediction.txt
  • *_copykat_heatmap.png
  • copykat_run.log

When metadata is supplied, an additional annotated heatmap PNG is produced. AnnData workflows also write *_selected_obs_meta.csv.


Annotated Heatmap with Metadata

Produce a CNA heatmap with per-cell metadata annotations (cell type, cluster labels, etc.). Rows are split into labelled groups by a chosen metadata column and ordered by hierarchical or K-means clustering within each group.

CLI — standalone re-plot (copykat-py-plot)

Re-plot from an existing CNA results file without re-running the full analysis:

copykat-py-plot \
    --cna  sample_copykat_CNA_results.txt \
    --meta xenium_ft_full_meta_celltype_leiden.csv \
    --row-split inferred_CellType \
    --sample-name xenium_all_cells \
    --n-cores 40 \
    --output xenium_annotated_heatmap.png
Flag Default Description
--cna / -c (required) *_copykat_CNA_results.txt from a copykat-py run
--meta / -m (required) Annotation CSV — first column = cell name, rest = metadata
--row-split second column Column used to split and label row groups

Meta CSV format — header row is auto-detected:

cell_name,leiden_cluster,inferred_CellType
aaaajgij-1,5,lumhr
aaaandia-1,5,lumhr
...

Cells present in the CNA results but absent from the CSV are labelled "unknown" and shown in grey. All remaining metadata columns are drawn as coloured annotation sidebars.

Python API — plot_heatmap_annotated

import pandas as pd
from copykat_py.plotting import plot_heatmap_annotated

cna = pd.read_csv("sample_copykat_CNA_results.txt", sep="\t")
plot_heatmap_annotated(
    mat           = cna.iloc[:, 3:].values.astype("float32"),
    cell_names    = cna.columns[3:].tolist(),
    chrom_info    = cna.iloc[:, 0].values,
    meta_csv      = "xenium_ft_full_meta_celltype_leiden.csv",
    row_split_col = "inferred_CellType",
    sample_name   = "xenium_all_cells",
    n_cores       = 40,
    output_path   = "xenium_annotated_heatmap.png",
)

Benchmarking and Validation

10X Dataset Validation

Both CopyKAT-R and CopyKAT-Python were tested on raw datasets (no QC filtering) using 24 cores. A total of 16 datasets spanning human and mouse tissues across multiple 10X platforms were used for validation.

10x Genomics Benchmark Datasets
Sample Species Tissue Assay Reported Cells
human_pbmc_10k_3pv3 human PBMC healthy control Universal 3' v3 11,769
human_nsclc_5pv1 human NSCLC tumor Universal 5' v1 7,802
human_ovarian_flex human Ovarian cancer FFPE Flex 17,553
human_kidney_gemx_flex human Kidney nuclei control GEM-X Flex 4,633
human_hodgkins_3pv31 human Hodgkin's lymphoma Universal 3' v3.1 3,394
human_breast_idc_7p5k_3pv31 human Invasive ductal carcinoma Universal 3' v3.1 5,680
human_breast_idc_750_lt_3pv31 human Invasive ductal carcinoma Universal 3' LT v3.1 687
human_melanoma_5p_nextgem human Melanoma dissociated tumor cells Universal 5' NextGEM 6,704
mouse_brain_neurons_2k_v21 mouse E18 brain neurons Universal 3' v2.1 2,022
mouse_brain_neurons_10k_v3 mouse Brain neurons Universal 3' v3 11,843
mouse_heart_1k_v3 mouse Heart E18 Universal 3' v3 1,011
mouse_heart_10k_v3 mouse Heart E18 Universal 3' v3 7,713
mouse_brain_e18_10k_si_3pv31 mouse E18 cortex hippocampus SVZ Universal 3' v3.1 SI 11,316
mouse_kidney_nuclei_1k_3pv31 mouse Adult kidney nuclei Universal 3' v3.1 1,385
mouse_liver_nuclei_5k_3pv31 mouse Adult liver nuclei Universal 3' v3.1 6,311
mouse_brain_gemx mouse E18 brain neurons Universal 3' GEM-X 12,441

Side-by-Side Comparison: CopyKAT-R vs CopyKAT-Python

human_pbmc_10k_3pv3 image

human_breast_idc_7p5k_3pv31 image

mouse_brain_e18_10k_si_3pv31 image

Key Metrics Comparison

image

Large-Scale Testing: Xenium Atera Dataset

The full FFPE Human Breast Cancer Xenium (Atera) dataset was subsetted to 50k, 100k, and full (~170k cells) to evaluate scalability.

Runtime Comparison image

CNV Heatmap with Annotation

image

CNV comparison image

Why Results May Differ from CopyKAT-R

From the above comparison of the final prediction, the Seurat cluster 4 was called diploid by CopyKAT-R but aneuploid by CopyKAT-Py. The copykat-py call was confirmed correct through the corresponding H&E cell morphology in this case.

The key difference is in the final prediction step (step 8), where both implementations perform hierarchical clustering on the adjusted CNA matrix and cut the tree at k=2. R's copykat explicitly uses method = "ward.D" in hclust(), while CopyKAT-Python uses scipy/fastcluster's "ward", which implements the mathematically correct ward.D2 criterion. For cells cluster (like Seurat cluster 4 here,) with subtle CNV profiles that sit near the boundary of the diploid/aneuploid split, the two linkage variants produce different dendrogram topologies, causing the binary label assignment to flip.

CopyKAT-Python results may not be identical to CopyKAT-R due to differences in:

High-confidence results typically show:

  • Clear chromosome-arm or whole-chromosome CNV patterns
  • Consistent CNV profiles within clusters
  • Strong separation between inferred diploid and aneuploid cells

Lower-confidence results may occur in samples with:

  • Weak CNV signal or low sequencing depth
  • Few normal reference cells
  • Strong batch effects
  • Near-diploid tumor genomes

Disclaimer: CopyKAT-Python is an independent reimplementation focused on scalability and usability, while faithfully reproducing the core CopyKAT analytical strategy.

  • Gene annotation versions
  • Filtering and preprocessing steps
  • Numerical implementation details
  • Smoothing and segmentation algorithms
  • Clustering behavior (parDist + hcluster vs. PCA + fastcluster)