cNF dataset and Phosphosite additions by jjacobson95 · Pull Request #470 · PNNL-CompBio/coderdata

jjacobson95 · 2026-05-14T21:34:30Z

Adds the cNF (cutaneous neurofibroma) organoid drug screen as a new coderdata dataset, along with a new global phosphosites reference set used by the cNF omics pipeline.

Resolves #469

New dataset: cnf

Drug-response measurements for 238 compounds across cNF patient-derived organoids from 10 patients (23 tumor specimens), with matched RNA-seq, global proteomics, and phosphoproteomics.

cnf_samples.csv - specimens sourced from four Synapse origins (drug screen index, proteomics matrix, RNA discovery file, Normal Skin folder); all specimen strings canonicalized through cnf_utils.classify_specimen; sample types: patient derived organoid, tumor, normal_tissue
cnf_transcriptomics.csv - gene-level TPM from per-cohort salmon matrices (syn66352931, syn70765053); six protocol-optimization columns excluded
cnf_proteomics.csv - batch-corrected global proteomics (syn74815895, correctedAbundance)
cnf_phosphoproteomics.csv - batch-corrected phosphoproteomics (syn70078415) mapped to phosphosite_id via phosphosites.csv; 2,974/2,986 unique sites matched (99.6%); 12 permanently unmatched due to missing entrez_id in genes.csv (BAP18, C11orf96, C14orf93, C1orf21, C2orf49, CAST, PAXX, SMAP, and four others)
cnf_experiments.tsv - multi-dose drugs run through fit_curve.py (fit_auc, fit_ic50, etc.); single-dose drugs recorded as dose_response_metric = uM_viability, value = viability fraction at 1 uM
cnf_drugs.tsv / cnf_drug_descriptors.tsv.gz
One known dual-attribution case: NF0021_T1_Onalespid_1uM viability data is also attributed to NF0021_T1 via SPECIMEN_DUAL_MAPPINGS

New reference set: phosphosites

Produces phosphosites.csv (~126,150 sites) from three priority-ordered sources:

Ochoa et al. (2020) Nat Biotechnol 38:365-373 Supplementary Table S3 (~116k sites, Springer CDN)
UniProt PTM REST API (~12k additional)
Synapse supplement syn70078415 (~188 additional, from cNF raw phospho data)

phosphosite_id values are stable across builds via --prev. Genes step must complete before phosphosites runs; both build_dataset.py and build_all.py enforce this sequencing.

Schema changes

Added Phosphosite class and phosphosite_id slot
Added phosphoproteomics slot to the omics data model
Added uM_viability to the ResponseMetric enum (single-dose viability fraction at 1 uM, range 0-1)
Updated expected_files.yaml with cnf and phosphosites output files

CI

Added build-cnf and build-phosphosites jobs to .github/workflows/build.yml

Genes

Previously this would have random failures that crashed the build process.

00-buildGeneFile.R now retries the biomaRt Ensembl call across all four mirrors (www, useast, asia, uswest) with exponential backoff (30s base, doubling each cycle, up to 5 cycles) instead of failing on the first network error. Messages/logging included now too.
Fixed a duplicate column rename in the joined.df pipeline that silently dropped the gene_symbol
Added explicit by= arguments to all joins to avoid deprecation warning

A note: the full build process has not been run with this. @sgosline do you think we should do a new full build? There have been a number of changes since the last one so there will likely be some debugging required.

Possible next steps: Add more phosphosite data. This could include beataml and cell lines to start.

…d in.

jjacobson95 added 3 commits April 27, 2026 19:26

initial pass. Not very close to working yet

36ed363

All CNF and phosphosite code

d79dcaf

All should be ready, these are just some small changes i forgot to ad…

a378a11

…d in.

jjacobson95 requested a review from sgosline May 14, 2026 21:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cNF dataset and Phosphosite additions#470

cNF dataset and Phosphosite additions#470
jjacobson95 wants to merge 3 commits into
mainfrom
cnf

jjacobson95 commented May 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jjacobson95 commented May 14, 2026

New dataset: cnf

New reference set: phosphosites

Schema changes

CI

Genes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant