cNF dataset and Phosphosite additions#470
Open
jjacobson95 wants to merge 3 commits into
Open
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Adds the cNF (cutaneous neurofibroma) organoid drug screen as a new coderdata dataset, along with a new global phosphosites reference set used by the cNF omics pipeline.
Resolves #469
New dataset: cnf
Drug-response measurements for 238 compounds across cNF patient-derived organoids from 10 patients (23 tumor specimens), with matched RNA-seq, global proteomics, and phosphoproteomics.
cnf_samples.csv- specimens sourced from four Synapse origins (drug screen index, proteomics matrix, RNA discovery file, Normal Skin folder); all specimen strings canonicalized throughcnf_utils.classify_specimen; sample types:patient derived organoid,tumor,normal_tissuecnf_transcriptomics.csv- gene-level TPM from per-cohort salmon matrices (syn66352931, syn70765053); six protocol-optimization columns excludedcnf_proteomics.csv- batch-corrected global proteomics (syn74815895,correctedAbundance)cnf_phosphoproteomics.csv- batch-corrected phosphoproteomics (syn70078415) mapped tophosphosite_idviaphosphosites.csv; 2,974/2,986 unique sites matched (99.6%); 12 permanently unmatched due to missingentrez_idin genes.csv (BAP18,C11orf96,C14orf93,C1orf21,C2orf49,CAST,PAXX,SMAP, and four others)cnf_experiments.tsv- multi-dose drugs run throughfit_curve.py(fit_auc,fit_ic50, etc.); single-dose drugs recorded asdose_response_metric = uM_viability, value = viability fraction at 1 uMcnf_drugs.tsv/cnf_drug_descriptors.tsv.gzNF0021_T1_Onalespid_1uMviability data is also attributed toNF0021_T1viaSPECIMEN_DUAL_MAPPINGSNew reference set: phosphosites
Produces
phosphosites.csv(~126,150 sites) from three priority-ordered sources:syn70078415(~188 additional, from cNF raw phospho data)phosphosite_idvalues are stable across builds via--prev. Genes step must complete before phosphosites runs; bothbuild_dataset.pyandbuild_all.pyenforce this sequencing.Schema changes
Phosphositeclass andphosphosite_idslotphosphoproteomicsslot to the omics data modeluM_viabilityto theResponseMetricenum (single-dose viability fraction at 1 uM, range 0-1)expected_files.yamlwith cnf and phosphosites output filesCI
build-cnfandbuild-phosphositesjobs to.github/workflows/build.ymlGenes
Previously this would have random failures that crashed the build process.
00-buildGeneFile.Rnow retries the biomaRt Ensembl call across all four mirrors (www,useast,asia,uswest) with exponential backoff (30s base, doubling each cycle, up to 5 cycles) instead of failing on the first network error. Messages/logging included now too.joined.dfpipeline that silently dropped thegene_symbolby=arguments to all joins to avoid deprecation warningA note: the full build process has not been run with this. @sgosline do you think we should do a new full build? There have been a number of changes since the last one so there will likely be some debugging required.
Possible next steps: Add more phosphosite data. This could include beataml and cell lines to start.