Skip to content

cNF dataset and Phosphosite additions#470

Open
jjacobson95 wants to merge 3 commits into
mainfrom
cnf
Open

cNF dataset and Phosphosite additions#470
jjacobson95 wants to merge 3 commits into
mainfrom
cnf

Conversation

@jjacobson95
Copy link
Copy Markdown
Collaborator

Adds the cNF (cutaneous neurofibroma) organoid drug screen as a new coderdata dataset, along with a new global phosphosites reference set used by the cNF omics pipeline.

Resolves #469

New dataset: cnf

Drug-response measurements for 238 compounds across cNF patient-derived organoids from 10 patients (23 tumor specimens), with matched RNA-seq, global proteomics, and phosphoproteomics.

  • cnf_samples.csv - specimens sourced from four Synapse origins (drug screen index, proteomics matrix, RNA discovery file, Normal Skin folder); all specimen strings canonicalized through cnf_utils.classify_specimen; sample types: patient derived organoid, tumor, normal_tissue
  • cnf_transcriptomics.csv - gene-level TPM from per-cohort salmon matrices (syn66352931, syn70765053); six protocol-optimization columns excluded
  • cnf_proteomics.csv - batch-corrected global proteomics (syn74815895, correctedAbundance)
  • cnf_phosphoproteomics.csv - batch-corrected phosphoproteomics (syn70078415) mapped to phosphosite_id via phosphosites.csv; 2,974/2,986 unique sites matched (99.6%); 12 permanently unmatched due to missing entrez_id in genes.csv (BAP18, C11orf96, C14orf93, C1orf21, C2orf49, CAST, PAXX, SMAP, and four others)
  • cnf_experiments.tsv - multi-dose drugs run through fit_curve.py (fit_auc, fit_ic50, etc.); single-dose drugs recorded as dose_response_metric = uM_viability, value = viability fraction at 1 uM
  • cnf_drugs.tsv / cnf_drug_descriptors.tsv.gz
  • One known dual-attribution case: NF0021_T1_Onalespid_1uM viability data is also attributed to NF0021_T1 via SPECIMEN_DUAL_MAPPINGS

New reference set: phosphosites

Produces phosphosites.csv (~126,150 sites) from three priority-ordered sources:

  1. Ochoa et al. (2020) Nat Biotechnol 38:365-373 Supplementary Table S3 (~116k sites, Springer CDN)
  2. UniProt PTM REST API (~12k additional)
  3. Synapse supplement syn70078415 (~188 additional, from cNF raw phospho data)

phosphosite_id values are stable across builds via --prev. Genes step must complete before phosphosites runs; both build_dataset.py and build_all.py enforce this sequencing.

Schema changes

  • Added Phosphosite class and phosphosite_id slot
  • Added phosphoproteomics slot to the omics data model
  • Added uM_viability to the ResponseMetric enum (single-dose viability fraction at 1 uM, range 0-1)
  • Updated expected_files.yaml with cnf and phosphosites output files

CI

  • Added build-cnf and build-phosphosites jobs to .github/workflows/build.yml

Genes

Previously this would have random failures that crashed the build process.

  • 00-buildGeneFile.R now retries the biomaRt Ensembl call across all four mirrors (www, useast, asia, uswest) with exponential backoff (30s base, doubling each cycle, up to 5 cycles) instead of failing on the first network error. Messages/logging included now too.
  • Fixed a duplicate column rename in the joined.df pipeline that silently dropped the gene_symbol
  • Added explicit by= arguments to all joins to avoid deprecation warning

A note: the full build process has not been run with this. @sgosline do you think we should do a new full build? There have been a number of changes since the last one so there will likely be some debugging required.

Possible next steps: Add more phosphosite data. This could include beataml and cell lines to start.

@jjacobson95 jjacobson95 requested a review from sgosline May 14, 2026 21:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add in schema for phosphoproteomics

1 participant