From 5510e6dc6d1f80bec957ee4d4e3d7bc92310defc Mon Sep 17 00:00:00 2001 From: Tony Wu Date: Fri, 22 May 2026 12:13:01 -0400 Subject: [PATCH 1/2] docs(fractions): Update documentation w.r.t. fraction handling --- DESCRIPTION | 2 +- R/utils_documentation.R | 9 ++++- man/DIAUmpiretoMSstatsFormat.Rd | 9 ++++- man/FragPipetoMSstatsFormat.Rd | 9 ++++- man/MSstatsConvert.Rd | 1 + man/MaxQtoMSstatsFormat.Rd | 9 ++++- man/MaxQtoMSstatsTMTFormat.Rd | 9 ++++- man/MetamorpheusToMSstatsFormat.Rd | 9 ++++- man/OpenMStoMSstatsFormat.Rd | 9 ++++- man/OpenSWATHtoMSstatsFormat.Rd | 9 ++++- man/PDtoMSstatsFormat.Rd | 9 ++++- man/PDtoMSstatsTMTFormat.Rd | 9 ++++- man/PhilosophertoMSstatsTMTFormat.Rd | 9 ++++- man/ProgenesistoMSstatsFormat.Rd | 9 ++++- man/ProteinProspectortoMSstatsTMTFormat.Rd | 9 ++++- man/SpectroMinetoMSstatsTMTFormat.Rd | 9 ++++- man/SpectronauttoMSstatsFormat.Rd | 9 ++++- man/dot-getFullDesign.Rd | 8 ++--- man/dot-sharedParametersAmongConverters.Rd | 9 ++++- vignettes/msstats_data_format.Rmd | 38 ++++++++++++++++++---- 20 files changed, 165 insertions(+), 28 deletions(-) diff --git a/DESCRIPTION b/DESCRIPTION index af973548e..fbbfc59c3 100644 --- a/DESCRIPTION +++ b/DESCRIPTION @@ -14,7 +14,6 @@ License: Artistic-2.0 Encoding: UTF-8 LazyData: true Roxygen: list(markdown = TRUE) -RoxygenNote: 7.3.3 biocViews: MassSpectrometry, Proteomics, Software, DataImport, QualityControl Depends: R (>= 4.0) @@ -82,3 +81,4 @@ Collate: 'utils_logging.R' 'utils_shared_peptides.R' VignetteBuilder: knitr +Config/roxygen2/version: 8.0.0 diff --git a/R/utils_documentation.R b/R/utils_documentation.R index 3bf208f71..9e0a8b7ae 100644 --- a/R/utils_documentation.R +++ b/R/utils_documentation.R @@ -5,7 +5,14 @@ #' @param removeFewMeasurements TRUE (default) will remove the features that have 1 or 2 measurements across runs. #' @param useUniquePeptide TRUE (default) removes peptides that are assigned for more than one proteins. #' We assume to use unique peptide for each protein. -#' @param summaryforMultipleRows max or sum - when there are multiple measurements for certain feature and certain run, use highest or sum of multiple intensities. Default is max for label-free converters and sum for TMT converters. +#' @param summaryforMultipleRows max or sum - when multiple PSMs identify +#' the same feature within a single MS run (duplicate PSMs), use the +#' highest (max) or sum of the duplicate intensities. Default is max for +#' label-free converters and sum for TMT converters. This parameter does +#' not control collapsing across fractions of the same biological mixture; +#' fraction handling is performed separately by `MSstatsBalancedDesign()` +#' (see the "Fractions and balanced design" section of the data format +#' vignette). #' @param removeProtein_with1Feature TRUE will remove the proteins which have only 1 feature, which is the combination of peptide, precursor charge, fragment and charge. FALSE is default. #' @param removeProtein_with1Peptide TRUE will remove the proteins which have only 1 peptide and charge. FALSE is default. #' @param removeOxidationMpeptides TRUE will remove the peptides including 'oxidation (M)' in modification. FALSE is default. diff --git a/man/DIAUmpiretoMSstatsFormat.Rd b/man/DIAUmpiretoMSstatsFormat.Rd index 60ddb5623..aa04273d0 100644 --- a/man/DIAUmpiretoMSstatsFormat.Rd +++ b/man/DIAUmpiretoMSstatsFormat.Rd @@ -38,7 +38,14 @@ DIAUmpiretoMSstatsFormat( \item{removeProtein_with1Feature}{TRUE will remove the proteins which have only 1 feature, which is the combination of peptide, precursor charge, fragment and charge. FALSE is default.} -\item{summaryforMultipleRows}{max or sum - when there are multiple measurements for certain feature and certain run, use highest or sum of multiple intensities. Default is max for label-free converters and sum for TMT converters.} +\item{summaryforMultipleRows}{max or sum - when multiple PSMs identify +the same feature within a single MS run (duplicate PSMs), use the +highest (max) or sum of the duplicate intensities. Default is max for +label-free converters and sum for TMT converters. This parameter does +not control collapsing across fractions of the same biological mixture; +fraction handling is performed separately by \code{MSstatsBalancedDesign()} +(see the "Fractions and balanced design" section of the data format +vignette).} \item{use_log_file}{logical. If TRUE, information about data processing will be saved to a file.} diff --git a/man/FragPipetoMSstatsFormat.Rd b/man/FragPipetoMSstatsFormat.Rd index f2b2e86f4..323007c90 100644 --- a/man/FragPipetoMSstatsFormat.Rd +++ b/man/FragPipetoMSstatsFormat.Rd @@ -27,7 +27,14 @@ We assume to use unique peptide for each protein.} \item{removeProtein_with1Feature}{TRUE will remove the proteins which have only 1 feature, which is the combination of peptide, precursor charge, fragment and charge. FALSE is default.} -\item{summaryforMultipleRows}{max or sum - when there are multiple measurements for certain feature and certain run, use highest or sum of multiple intensities. Default is max for label-free converters and sum for TMT converters.} +\item{summaryforMultipleRows}{max or sum - when multiple PSMs identify +the same feature within a single MS run (duplicate PSMs), use the +highest (max) or sum of the duplicate intensities. Default is max for +label-free converters and sum for TMT converters. This parameter does +not control collapsing across fractions of the same biological mixture; +fraction handling is performed separately by \code{MSstatsBalancedDesign()} +(see the "Fractions and balanced design" section of the data format +vignette).} \item{use_log_file}{logical. If TRUE, information about data processing will be saved to a file.} diff --git a/man/MSstatsConvert.Rd b/man/MSstatsConvert.Rd index e87142964..1a2e9ca25 100644 --- a/man/MSstatsConvert.Rd +++ b/man/MSstatsConvert.Rd @@ -23,6 +23,7 @@ signal processing tools to a format suitable for statistical analysis with the M Authors: \itemize{ + \item Anthony Wu \email{wu.anthon@northeastern.edu} \item Mateusz Staniak \email{mtst@mstaniak.pl} \item Devon Kohler \email{kohler.d@northeastern.edu} \item Meena Choi \email{mnchoi67@gmail.com} diff --git a/man/MaxQtoMSstatsFormat.Rd b/man/MaxQtoMSstatsFormat.Rd index dd22234e9..edf657b5b 100644 --- a/man/MaxQtoMSstatsFormat.Rd +++ b/man/MaxQtoMSstatsFormat.Rd @@ -34,7 +34,14 @@ MaxQtoMSstatsFormat( \item{useUniquePeptide}{TRUE (default) removes peptides that are assigned for more than one proteins. We assume to use unique peptide for each protein.} -\item{summaryforMultipleRows}{max or sum - when there are multiple measurements for certain feature and certain run, use highest or sum of multiple intensities. Default is max for label-free converters and sum for TMT converters.} +\item{summaryforMultipleRows}{max or sum - when multiple PSMs identify +the same feature within a single MS run (duplicate PSMs), use the +highest (max) or sum of the duplicate intensities. Default is max for +label-free converters and sum for TMT converters. This parameter does +not control collapsing across fractions of the same biological mixture; +fraction handling is performed separately by \code{MSstatsBalancedDesign()} +(see the "Fractions and balanced design" section of the data format +vignette).} \item{removeFewMeasurements}{TRUE (default) will remove the features that have 1 or 2 measurements across runs.} diff --git a/man/MaxQtoMSstatsTMTFormat.Rd b/man/MaxQtoMSstatsTMTFormat.Rd index af761b6be..4f8d0801b 100644 --- a/man/MaxQtoMSstatsTMTFormat.Rd +++ b/man/MaxQtoMSstatsTMTFormat.Rd @@ -39,7 +39,14 @@ We assume to use unique peptide for each protein.} \item{rmProtein_with1Feature}{TRUE will remove the proteins which have only 1 peptide and charge. Default is FALSE.} -\item{summaryforMultipleRows}{max or sum - when there are multiple measurements for certain feature and certain run, use highest or sum of multiple intensities. Default is max for label-free converters and sum for TMT converters.} +\item{summaryforMultipleRows}{max or sum - when multiple PSMs identify +the same feature within a single MS run (duplicate PSMs), use the +highest (max) or sum of the duplicate intensities. Default is max for +label-free converters and sum for TMT converters. This parameter does +not control collapsing across fractions of the same biological mixture; +fraction handling is performed separately by \code{MSstatsBalancedDesign()} +(see the "Fractions and balanced design" section of the data format +vignette).} \item{use_log_file}{logical. If TRUE, information about data processing will be saved to a file.} diff --git a/man/MetamorpheusToMSstatsFormat.Rd b/man/MetamorpheusToMSstatsFormat.Rd index 1e8851be7..cf0e643c7 100644 --- a/man/MetamorpheusToMSstatsFormat.Rd +++ b/man/MetamorpheusToMSstatsFormat.Rd @@ -36,7 +36,14 @@ We assume to use unique peptide for each protein.} \item{removeProtein_with1Feature}{TRUE will remove the proteins which have only 1 feature, which is the combination of peptide, precursor charge, fragment and charge. FALSE is default.} -\item{summaryforMultipleRows}{max or sum - when there are multiple measurements for certain feature and certain run, use highest or sum of multiple intensities. Default is max for label-free converters and sum for TMT converters.} +\item{summaryforMultipleRows}{max or sum - when multiple PSMs identify +the same feature within a single MS run (duplicate PSMs), use the +highest (max) or sum of the duplicate intensities. Default is max for +label-free converters and sum for TMT converters. This parameter does +not control collapsing across fractions of the same biological mixture; +fraction handling is performed separately by \code{MSstatsBalancedDesign()} +(see the "Fractions and balanced design" section of the data format +vignette).} \item{use_log_file}{logical. If TRUE, information about data processing will be saved to a file.} diff --git a/man/OpenMStoMSstatsFormat.Rd b/man/OpenMStoMSstatsFormat.Rd index 2f8c39710..ce894243e 100644 --- a/man/OpenMStoMSstatsFormat.Rd +++ b/man/OpenMStoMSstatsFormat.Rd @@ -31,7 +31,14 @@ We assume to use unique peptide for each protein.} \item{removeProtein_with1Feature}{TRUE will remove the proteins which have only 1 feature, which is the combination of peptide, precursor charge, fragment and charge. FALSE is default.} -\item{summaryforMultipleRows}{max or sum - when there are multiple measurements for certain feature and certain run, use highest or sum of multiple intensities. Default is max for label-free converters and sum for TMT converters.} +\item{summaryforMultipleRows}{max or sum - when multiple PSMs identify +the same feature within a single MS run (duplicate PSMs), use the +highest (max) or sum of the duplicate intensities. Default is max for +label-free converters and sum for TMT converters. This parameter does +not control collapsing across fractions of the same biological mixture; +fraction handling is performed separately by \code{MSstatsBalancedDesign()} +(see the "Fractions and balanced design" section of the data format +vignette).} \item{use_log_file}{logical. If TRUE, information about data processing will be saved to a file.} diff --git a/man/OpenSWATHtoMSstatsFormat.Rd b/man/OpenSWATHtoMSstatsFormat.Rd index 74ed2a3e2..05c9a4f4b 100644 --- a/man/OpenSWATHtoMSstatsFormat.Rd +++ b/man/OpenSWATHtoMSstatsFormat.Rd @@ -37,7 +37,14 @@ We assume to use unique peptide for each protein.} \item{removeProtein_with1Feature}{TRUE will remove the proteins which have only 1 feature, which is the combination of peptide, precursor charge, fragment and charge. FALSE is default.} -\item{summaryforMultipleRows}{max or sum - when there are multiple measurements for certain feature and certain run, use highest or sum of multiple intensities. Default is max for label-free converters and sum for TMT converters.} +\item{summaryforMultipleRows}{max or sum - when multiple PSMs identify +the same feature within a single MS run (duplicate PSMs), use the +highest (max) or sum of the duplicate intensities. Default is max for +label-free converters and sum for TMT converters. This parameter does +not control collapsing across fractions of the same biological mixture; +fraction handling is performed separately by \code{MSstatsBalancedDesign()} +(see the "Fractions and balanced design" section of the data format +vignette).} \item{use_log_file}{logical. If TRUE, information about data processing will be saved to a file.} diff --git a/man/PDtoMSstatsFormat.Rd b/man/PDtoMSstatsFormat.Rd index 40dbe8c05..ea1b6dc71 100644 --- a/man/PDtoMSstatsFormat.Rd +++ b/man/PDtoMSstatsFormat.Rd @@ -34,7 +34,14 @@ Run information. 'Run' will be matched with 'Spectrum.File'.} \item{useUniquePeptide}{TRUE (default) removes peptides that are assigned for more than one proteins. We assume to use unique peptide for each protein.} -\item{summaryforMultipleRows}{max or sum - when there are multiple measurements for certain feature and certain run, use highest or sum of multiple intensities. Default is max for label-free converters and sum for TMT converters.} +\item{summaryforMultipleRows}{max or sum - when multiple PSMs identify +the same feature within a single MS run (duplicate PSMs), use the +highest (max) or sum of the duplicate intensities. Default is max for +label-free converters and sum for TMT converters. This parameter does +not control collapsing across fractions of the same biological mixture; +fraction handling is performed separately by \code{MSstatsBalancedDesign()} +(see the "Fractions and balanced design" section of the data format +vignette).} \item{removeFewMeasurements}{TRUE (default) will remove the features that have 1 or 2 measurements across runs.} diff --git a/man/PDtoMSstatsTMTFormat.Rd b/man/PDtoMSstatsTMTFormat.Rd index 46a4cc561..688c8e0fd 100644 --- a/man/PDtoMSstatsTMTFormat.Rd +++ b/man/PDtoMSstatsTMTFormat.Rd @@ -37,7 +37,14 @@ We assume to use unique peptide for each protein.} \item{rmProtein_with1Feature}{TRUE will remove the proteins which have only 1 peptide and charge. Default is FALSE.} -\item{summaryforMultipleRows}{max or sum - when there are multiple measurements for certain feature and certain run, use highest or sum of multiple intensities. Default is max for label-free converters and sum for TMT converters.} +\item{summaryforMultipleRows}{max or sum - when multiple PSMs identify +the same feature within a single MS run (duplicate PSMs), use the +highest (max) or sum of the duplicate intensities. Default is max for +label-free converters and sum for TMT converters. This parameter does +not control collapsing across fractions of the same biological mixture; +fraction handling is performed separately by \code{MSstatsBalancedDesign()} +(see the "Fractions and balanced design" section of the data format +vignette).} \item{use_log_file}{logical. If TRUE, information about data processing will be saved to a file.} diff --git a/man/PhilosophertoMSstatsTMTFormat.Rd b/man/PhilosophertoMSstatsTMTFormat.Rd index d2c8225a7..26900ea73 100644 --- a/man/PhilosophertoMSstatsTMTFormat.Rd +++ b/man/PhilosophertoMSstatsTMTFormat.Rd @@ -50,7 +50,14 @@ We assume to use unique peptide for each protein.} \item{rmProtein_with1Feature}{TRUE will remove the proteins which have only 1 peptide and charge. Default is FALSE.} -\item{summaryforMultipleRows}{max or sum - when there are multiple measurements for certain feature and certain run, use highest or sum of multiple intensities. Default is max for label-free converters and sum for TMT converters.} +\item{summaryforMultipleRows}{max or sum - when multiple PSMs identify +the same feature within a single MS run (duplicate PSMs), use the +highest (max) or sum of the duplicate intensities. Default is max for +label-free converters and sum for TMT converters. This parameter does +not control collapsing across fractions of the same biological mixture; +fraction handling is performed separately by \code{MSstatsBalancedDesign()} +(see the "Fractions and balanced design" section of the data format +vignette).} \item{use_log_file}{logical. If TRUE, information about data processing will be saved to a file.} diff --git a/man/ProgenesistoMSstatsFormat.Rd b/man/ProgenesistoMSstatsFormat.Rd index 3d4cc7caf..220f38b19 100644 --- a/man/ProgenesistoMSstatsFormat.Rd +++ b/man/ProgenesistoMSstatsFormat.Rd @@ -27,7 +27,14 @@ ProgenesistoMSstatsFormat( \item{useUniquePeptide}{TRUE (default) removes peptides that are assigned for more than one proteins. We assume to use unique peptide for each protein.} -\item{summaryforMultipleRows}{max or sum - when there are multiple measurements for certain feature and certain run, use highest or sum of multiple intensities. Default is max for label-free converters and sum for TMT converters.} +\item{summaryforMultipleRows}{max or sum - when multiple PSMs identify +the same feature within a single MS run (duplicate PSMs), use the +highest (max) or sum of the duplicate intensities. Default is max for +label-free converters and sum for TMT converters. This parameter does +not control collapsing across fractions of the same biological mixture; +fraction handling is performed separately by \code{MSstatsBalancedDesign()} +(see the "Fractions and balanced design" section of the data format +vignette).} \item{removeFewMeasurements}{TRUE (default) will remove the features that have 1 or 2 measurements across runs.} diff --git a/man/ProteinProspectortoMSstatsTMTFormat.Rd b/man/ProteinProspectortoMSstatsTMTFormat.Rd index 9e9d49db2..ad0cbd9c9 100644 --- a/man/ProteinProspectortoMSstatsTMTFormat.Rd +++ b/man/ProteinProspectortoMSstatsTMTFormat.Rd @@ -32,7 +32,14 @@ We assume to use unique peptide for each protein.} \item{removeProtein_with1Feature}{TRUE will remove the proteins which have only 1 feature, which is the combination of peptide, precursor charge, fragment and charge. FALSE is default.} -\item{summaryforMultipleRows}{max or sum - when there are multiple measurements for certain feature and certain run, use highest or sum of multiple intensities. Default is max for label-free converters and sum for TMT converters.} +\item{summaryforMultipleRows}{max or sum - when multiple PSMs identify +the same feature within a single MS run (duplicate PSMs), use the +highest (max) or sum of the duplicate intensities. Default is max for +label-free converters and sum for TMT converters. This parameter does +not control collapsing across fractions of the same biological mixture; +fraction handling is performed separately by \code{MSstatsBalancedDesign()} +(see the "Fractions and balanced design" section of the data format +vignette).} \item{use_log_file}{logical. If TRUE, information about data processing will be saved to a file.} diff --git a/man/SpectroMinetoMSstatsTMTFormat.Rd b/man/SpectroMinetoMSstatsTMTFormat.Rd index cca8cb167..e1f1757ea 100644 --- a/man/SpectroMinetoMSstatsTMTFormat.Rd +++ b/man/SpectroMinetoMSstatsTMTFormat.Rd @@ -36,7 +36,14 @@ We assume to use unique peptide for each protein.} \item{rmProtein_with1Feature}{TRUE will remove the proteins which have only 1 peptide and charge. Defaut is FALSE.} -\item{summaryforMultipleRows}{max or sum - when there are multiple measurements for certain feature and certain run, use highest or sum of multiple intensities. Default is max for label-free converters and sum for TMT converters.} +\item{summaryforMultipleRows}{max or sum - when multiple PSMs identify +the same feature within a single MS run (duplicate PSMs), use the +highest (max) or sum of the duplicate intensities. Default is max for +label-free converters and sum for TMT converters. This parameter does +not control collapsing across fractions of the same biological mixture; +fraction handling is performed separately by \code{MSstatsBalancedDesign()} +(see the "Fractions and balanced design" section of the data format +vignette).} \item{use_log_file}{logical. If TRUE, information about data processing will be saved to a file.} diff --git a/man/SpectronauttoMSstatsFormat.Rd b/man/SpectronauttoMSstatsFormat.Rd index dea2e4b99..a32e21e50 100644 --- a/man/SpectronauttoMSstatsFormat.Rd +++ b/man/SpectronauttoMSstatsFormat.Rd @@ -75,7 +75,14 @@ We assume to use unique peptide for each protein.} \item{removeProtein_with1Feature}{TRUE will remove the proteins which have only 1 feature, which is the combination of peptide, precursor charge, fragment and charge. FALSE is default.} -\item{summaryforMultipleRows}{max or sum - when there are multiple measurements for certain feature and certain run, use highest or sum of multiple intensities. Default is max for label-free converters and sum for TMT converters.} +\item{summaryforMultipleRows}{max or sum - when multiple PSMs identify +the same feature within a single MS run (duplicate PSMs), use the +highest (max) or sum of the duplicate intensities. Default is max for +label-free converters and sum for TMT converters. This parameter does +not control collapsing across fractions of the same biological mixture; +fraction handling is performed separately by \code{MSstatsBalancedDesign()} +(see the "Fractions and balanced design" section of the data format +vignette).} \item{calculateAnomalyScores}{Default is FALSE. If TRUE, will run anomaly detection model and calculate anomaly scores for each feature. Used downstream to weigh measurements in differential analysis.} diff --git a/man/dot-getFullDesign.Rd b/man/dot-getFullDesign.Rd index a9f4ecbfc..f7b2720c3 100644 --- a/man/dot-getFullDesign.Rd +++ b/man/dot-getFullDesign.Rd @@ -13,12 +13,12 @@ \code{feature_col} and \code{measurement_col} will be created within each unique value of this column} -\item{is_tmt}{if TRUE, data will be treated as coming from TMT experiment.} - -\item{`feature_col`}{name of the column that labels features} +\item{feature_col}{name of the column that labels features} -\item{`measurement_col`}{name of a column with measurement labels - Runs in +\item{measurement_col}{name of a column with measurement labels - Runs in label-free case, Channels in TMT case.} + +\item{is_tmt}{if TRUE, data will be treated as coming from TMT experiment.} } \value{ data.table diff --git a/man/dot-sharedParametersAmongConverters.Rd b/man/dot-sharedParametersAmongConverters.Rd index c9e7c6cd7..100a14053 100644 --- a/man/dot-sharedParametersAmongConverters.Rd +++ b/man/dot-sharedParametersAmongConverters.Rd @@ -12,7 +12,14 @@ \item{useUniquePeptide}{TRUE (default) removes peptides that are assigned for more than one proteins. We assume to use unique peptide for each protein.} -\item{summaryforMultipleRows}{max or sum - when there are multiple measurements for certain feature and certain run, use highest or sum of multiple intensities. Default is max for label-free converters and sum for TMT converters.} +\item{summaryforMultipleRows}{max or sum - when multiple PSMs identify +the same feature within a single MS run (duplicate PSMs), use the +highest (max) or sum of the duplicate intensities. Default is max for +label-free converters and sum for TMT converters. This parameter does +not control collapsing across fractions of the same biological mixture; +fraction handling is performed separately by \code{MSstatsBalancedDesign()} +(see the "Fractions and balanced design" section of the data format +vignette).} \item{removeProtein_with1Feature}{TRUE will remove the proteins which have only 1 feature, which is the combination of peptide, precursor charge, fragment and charge. FALSE is default.} diff --git a/vignettes/msstats_data_format.Rmd b/vignettes/msstats_data_format.Rmd index dbf77916a..b2ba9aef6 100644 --- a/vignettes/msstats_data_format.Rmd +++ b/vignettes/msstats_data_format.Rmd @@ -304,13 +304,37 @@ should consists of elements named # Fractions and balanced design -Finally, after preprocessing, `MSstatsBalancedDesign` function can be applied to -handle fractions and create balanced design. -For label-free and SRM data, it means that fractionation or technical replicates will be detected if these information is not provided. Features measured in multiple fractions (overlapped) will be assigned to a unique fraction. Then, the data will be adjusted so that within each fraction, every feature has a row for certain run. If the intensity value is missing, it will be denoted by `NA`. - -For TMT data, a unique fraction will be selected for each overlapped feature and the -data will adjusted so that within each run, every feature has a row for each channel. -If the intensity is missing for a channel, it will be denoted by `NA`. +Finally, after preprocessing, the `MSstatsBalancedDesign` function can be +applied to handle fractions and create a balanced design. + +For label-free and SRM data, fractionation or technical replicates are +detected if this information is not provided. Features that overlap +across multiple fractions of the same sample are assigned to a single +fraction by the following rule: for each feature, the fraction with the +**largest number of MS runs containing a non-missing measurement** is +kept. If multiple fractions tie on that count, the tie is broken by +choosing the fraction with the **highest mean intensity**. The +remaining fractions' rows for that feature are dropped. The data are +then adjusted so that within each fraction, every feature has a row +for each run. If the intensity value is missing, it is denoted by `NA`. + +For TMT data, a unique fraction is selected for each overlapped feature +using a different cascade of criteria, applied in order until the +overlap is resolved: the fraction with the **highest mean intensity** +across channels is tried first, then the **highest sum**, then the +**highest single intensity** (`max`). If features still overlap after +all three passes, their intensities are averaged across fractions as a +final fallback. After fraction selection, the data are adjusted so +that within each run, every feature has a row for each channel. If +the intensity is missing for a channel, it is denoted by `NA`. + +These two rules differ by design: the label-free path optimizes for +the fraction with the most observations of the feature, while the TMT +path optimizes for the fraction with the strongest signal. Note also +that this fraction-collapsing logic is distinct from the +`summaryforMultipleRows` argument on each converter, which only +combines duplicate PSMs identifying the same feature within a single +MS run. ```{r } maxquant_balanced = MSstatsBalancedDesign(maxquant_processed, feature_columns) From 901c811a7b0aea211e95bb7fea7ff906327743e8 Mon Sep 17 00:00:00 2001 From: tonywu1999 Date: Wed, 27 May 2026 15:19:42 -0400 Subject: [PATCH 2/2] fix docs --- R/utils_documentation.R | 7 ++----- man/DIAUmpiretoMSstatsFormat.Rd | 7 ++----- man/FragPipetoMSstatsFormat.Rd | 7 ++----- man/MaxQtoMSstatsFormat.Rd | 7 ++----- man/MaxQtoMSstatsTMTFormat.Rd | 7 ++----- man/MetamorpheusToMSstatsFormat.Rd | 7 ++----- man/OpenMStoMSstatsFormat.Rd | 7 ++----- man/OpenSWATHtoMSstatsFormat.Rd | 7 ++----- man/PDtoMSstatsFormat.Rd | 7 ++----- man/PDtoMSstatsTMTFormat.Rd | 7 ++----- man/PhilosophertoMSstatsTMTFormat.Rd | 7 ++----- man/ProgenesistoMSstatsFormat.Rd | 7 ++----- man/ProteinProspectortoMSstatsTMTFormat.Rd | 7 ++----- man/SpectroMinetoMSstatsTMTFormat.Rd | 7 ++----- man/SpectronauttoMSstatsFormat.Rd | 7 ++----- man/dot-sharedParametersAmongConverters.Rd | 7 ++----- vignettes/msstats_data_format.Rmd | 14 +++----------- 17 files changed, 35 insertions(+), 91 deletions(-) diff --git a/R/utils_documentation.R b/R/utils_documentation.R index 9e0a8b7ae..59b64b3d7 100644 --- a/R/utils_documentation.R +++ b/R/utils_documentation.R @@ -8,11 +8,8 @@ #' @param summaryforMultipleRows max or sum - when multiple PSMs identify #' the same feature within a single MS run (duplicate PSMs), use the #' highest (max) or sum of the duplicate intensities. Default is max for -#' label-free converters and sum for TMT converters. This parameter does -#' not control collapsing across fractions of the same biological mixture; -#' fraction handling is performed separately by `MSstatsBalancedDesign()` -#' (see the "Fractions and balanced design" section of the data format -#' vignette). +#' label-free converters and sum for TMT converters. Note that this parameter +#' does NOT control collapsing across fractions of the same biological mixture. #' @param removeProtein_with1Feature TRUE will remove the proteins which have only 1 feature, which is the combination of peptide, precursor charge, fragment and charge. FALSE is default. #' @param removeProtein_with1Peptide TRUE will remove the proteins which have only 1 peptide and charge. FALSE is default. #' @param removeOxidationMpeptides TRUE will remove the peptides including 'oxidation (M)' in modification. FALSE is default. diff --git a/man/DIAUmpiretoMSstatsFormat.Rd b/man/DIAUmpiretoMSstatsFormat.Rd index aa04273d0..19f8d2339 100644 --- a/man/DIAUmpiretoMSstatsFormat.Rd +++ b/man/DIAUmpiretoMSstatsFormat.Rd @@ -41,11 +41,8 @@ DIAUmpiretoMSstatsFormat( \item{summaryforMultipleRows}{max or sum - when multiple PSMs identify the same feature within a single MS run (duplicate PSMs), use the highest (max) or sum of the duplicate intensities. Default is max for -label-free converters and sum for TMT converters. This parameter does -not control collapsing across fractions of the same biological mixture; -fraction handling is performed separately by \code{MSstatsBalancedDesign()} -(see the "Fractions and balanced design" section of the data format -vignette).} +label-free converters and sum for TMT converters. Note that this parameter +does NOT control collapsing across fractions of the same biological mixture.} \item{use_log_file}{logical. If TRUE, information about data processing will be saved to a file.} diff --git a/man/FragPipetoMSstatsFormat.Rd b/man/FragPipetoMSstatsFormat.Rd index 323007c90..3553e61c7 100644 --- a/man/FragPipetoMSstatsFormat.Rd +++ b/man/FragPipetoMSstatsFormat.Rd @@ -30,11 +30,8 @@ We assume to use unique peptide for each protein.} \item{summaryforMultipleRows}{max or sum - when multiple PSMs identify the same feature within a single MS run (duplicate PSMs), use the highest (max) or sum of the duplicate intensities. Default is max for -label-free converters and sum for TMT converters. This parameter does -not control collapsing across fractions of the same biological mixture; -fraction handling is performed separately by \code{MSstatsBalancedDesign()} -(see the "Fractions and balanced design" section of the data format -vignette).} +label-free converters and sum for TMT converters. Note that this parameter +does NOT control collapsing across fractions of the same biological mixture.} \item{use_log_file}{logical. If TRUE, information about data processing will be saved to a file.} diff --git a/man/MaxQtoMSstatsFormat.Rd b/man/MaxQtoMSstatsFormat.Rd index edf657b5b..3e154cbac 100644 --- a/man/MaxQtoMSstatsFormat.Rd +++ b/man/MaxQtoMSstatsFormat.Rd @@ -37,11 +37,8 @@ We assume to use unique peptide for each protein.} \item{summaryforMultipleRows}{max or sum - when multiple PSMs identify the same feature within a single MS run (duplicate PSMs), use the highest (max) or sum of the duplicate intensities. Default is max for -label-free converters and sum for TMT converters. This parameter does -not control collapsing across fractions of the same biological mixture; -fraction handling is performed separately by \code{MSstatsBalancedDesign()} -(see the "Fractions and balanced design" section of the data format -vignette).} +label-free converters and sum for TMT converters. Note that this parameter +does NOT control collapsing across fractions of the same biological mixture.} \item{removeFewMeasurements}{TRUE (default) will remove the features that have 1 or 2 measurements across runs.} diff --git a/man/MaxQtoMSstatsTMTFormat.Rd b/man/MaxQtoMSstatsTMTFormat.Rd index 4f8d0801b..aa468e64e 100644 --- a/man/MaxQtoMSstatsTMTFormat.Rd +++ b/man/MaxQtoMSstatsTMTFormat.Rd @@ -42,11 +42,8 @@ We assume to use unique peptide for each protein.} \item{summaryforMultipleRows}{max or sum - when multiple PSMs identify the same feature within a single MS run (duplicate PSMs), use the highest (max) or sum of the duplicate intensities. Default is max for -label-free converters and sum for TMT converters. This parameter does -not control collapsing across fractions of the same biological mixture; -fraction handling is performed separately by \code{MSstatsBalancedDesign()} -(see the "Fractions and balanced design" section of the data format -vignette).} +label-free converters and sum for TMT converters. Note that this parameter +does NOT control collapsing across fractions of the same biological mixture.} \item{use_log_file}{logical. If TRUE, information about data processing will be saved to a file.} diff --git a/man/MetamorpheusToMSstatsFormat.Rd b/man/MetamorpheusToMSstatsFormat.Rd index cf0e643c7..55333e30f 100644 --- a/man/MetamorpheusToMSstatsFormat.Rd +++ b/man/MetamorpheusToMSstatsFormat.Rd @@ -39,11 +39,8 @@ We assume to use unique peptide for each protein.} \item{summaryforMultipleRows}{max or sum - when multiple PSMs identify the same feature within a single MS run (duplicate PSMs), use the highest (max) or sum of the duplicate intensities. Default is max for -label-free converters and sum for TMT converters. This parameter does -not control collapsing across fractions of the same biological mixture; -fraction handling is performed separately by \code{MSstatsBalancedDesign()} -(see the "Fractions and balanced design" section of the data format -vignette).} +label-free converters and sum for TMT converters. Note that this parameter +does NOT control collapsing across fractions of the same biological mixture.} \item{use_log_file}{logical. If TRUE, information about data processing will be saved to a file.} diff --git a/man/OpenMStoMSstatsFormat.Rd b/man/OpenMStoMSstatsFormat.Rd index ce894243e..f72239444 100644 --- a/man/OpenMStoMSstatsFormat.Rd +++ b/man/OpenMStoMSstatsFormat.Rd @@ -34,11 +34,8 @@ We assume to use unique peptide for each protein.} \item{summaryforMultipleRows}{max or sum - when multiple PSMs identify the same feature within a single MS run (duplicate PSMs), use the highest (max) or sum of the duplicate intensities. Default is max for -label-free converters and sum for TMT converters. This parameter does -not control collapsing across fractions of the same biological mixture; -fraction handling is performed separately by \code{MSstatsBalancedDesign()} -(see the "Fractions and balanced design" section of the data format -vignette).} +label-free converters and sum for TMT converters. Note that this parameter +does NOT control collapsing across fractions of the same biological mixture.} \item{use_log_file}{logical. If TRUE, information about data processing will be saved to a file.} diff --git a/man/OpenSWATHtoMSstatsFormat.Rd b/man/OpenSWATHtoMSstatsFormat.Rd index 05c9a4f4b..a37e507a9 100644 --- a/man/OpenSWATHtoMSstatsFormat.Rd +++ b/man/OpenSWATHtoMSstatsFormat.Rd @@ -40,11 +40,8 @@ We assume to use unique peptide for each protein.} \item{summaryforMultipleRows}{max or sum - when multiple PSMs identify the same feature within a single MS run (duplicate PSMs), use the highest (max) or sum of the duplicate intensities. Default is max for -label-free converters and sum for TMT converters. This parameter does -not control collapsing across fractions of the same biological mixture; -fraction handling is performed separately by \code{MSstatsBalancedDesign()} -(see the "Fractions and balanced design" section of the data format -vignette).} +label-free converters and sum for TMT converters. Note that this parameter +does NOT control collapsing across fractions of the same biological mixture.} \item{use_log_file}{logical. If TRUE, information about data processing will be saved to a file.} diff --git a/man/PDtoMSstatsFormat.Rd b/man/PDtoMSstatsFormat.Rd index ea1b6dc71..d220d9ecf 100644 --- a/man/PDtoMSstatsFormat.Rd +++ b/man/PDtoMSstatsFormat.Rd @@ -37,11 +37,8 @@ We assume to use unique peptide for each protein.} \item{summaryforMultipleRows}{max or sum - when multiple PSMs identify the same feature within a single MS run (duplicate PSMs), use the highest (max) or sum of the duplicate intensities. Default is max for -label-free converters and sum for TMT converters. This parameter does -not control collapsing across fractions of the same biological mixture; -fraction handling is performed separately by \code{MSstatsBalancedDesign()} -(see the "Fractions and balanced design" section of the data format -vignette).} +label-free converters and sum for TMT converters. Note that this parameter +does NOT control collapsing across fractions of the same biological mixture.} \item{removeFewMeasurements}{TRUE (default) will remove the features that have 1 or 2 measurements across runs.} diff --git a/man/PDtoMSstatsTMTFormat.Rd b/man/PDtoMSstatsTMTFormat.Rd index 688c8e0fd..e26438085 100644 --- a/man/PDtoMSstatsTMTFormat.Rd +++ b/man/PDtoMSstatsTMTFormat.Rd @@ -40,11 +40,8 @@ We assume to use unique peptide for each protein.} \item{summaryforMultipleRows}{max or sum - when multiple PSMs identify the same feature within a single MS run (duplicate PSMs), use the highest (max) or sum of the duplicate intensities. Default is max for -label-free converters and sum for TMT converters. This parameter does -not control collapsing across fractions of the same biological mixture; -fraction handling is performed separately by \code{MSstatsBalancedDesign()} -(see the "Fractions and balanced design" section of the data format -vignette).} +label-free converters and sum for TMT converters. Note that this parameter +does NOT control collapsing across fractions of the same biological mixture.} \item{use_log_file}{logical. If TRUE, information about data processing will be saved to a file.} diff --git a/man/PhilosophertoMSstatsTMTFormat.Rd b/man/PhilosophertoMSstatsTMTFormat.Rd index 26900ea73..fa2ae66a7 100644 --- a/man/PhilosophertoMSstatsTMTFormat.Rd +++ b/man/PhilosophertoMSstatsTMTFormat.Rd @@ -53,11 +53,8 @@ We assume to use unique peptide for each protein.} \item{summaryforMultipleRows}{max or sum - when multiple PSMs identify the same feature within a single MS run (duplicate PSMs), use the highest (max) or sum of the duplicate intensities. Default is max for -label-free converters and sum for TMT converters. This parameter does -not control collapsing across fractions of the same biological mixture; -fraction handling is performed separately by \code{MSstatsBalancedDesign()} -(see the "Fractions and balanced design" section of the data format -vignette).} +label-free converters and sum for TMT converters. Note that this parameter +does NOT control collapsing across fractions of the same biological mixture.} \item{use_log_file}{logical. If TRUE, information about data processing will be saved to a file.} diff --git a/man/ProgenesistoMSstatsFormat.Rd b/man/ProgenesistoMSstatsFormat.Rd index 220f38b19..aa3a8307f 100644 --- a/man/ProgenesistoMSstatsFormat.Rd +++ b/man/ProgenesistoMSstatsFormat.Rd @@ -30,11 +30,8 @@ We assume to use unique peptide for each protein.} \item{summaryforMultipleRows}{max or sum - when multiple PSMs identify the same feature within a single MS run (duplicate PSMs), use the highest (max) or sum of the duplicate intensities. Default is max for -label-free converters and sum for TMT converters. This parameter does -not control collapsing across fractions of the same biological mixture; -fraction handling is performed separately by \code{MSstatsBalancedDesign()} -(see the "Fractions and balanced design" section of the data format -vignette).} +label-free converters and sum for TMT converters. Note that this parameter +does NOT control collapsing across fractions of the same biological mixture.} \item{removeFewMeasurements}{TRUE (default) will remove the features that have 1 or 2 measurements across runs.} diff --git a/man/ProteinProspectortoMSstatsTMTFormat.Rd b/man/ProteinProspectortoMSstatsTMTFormat.Rd index ad0cbd9c9..e2032fcac 100644 --- a/man/ProteinProspectortoMSstatsTMTFormat.Rd +++ b/man/ProteinProspectortoMSstatsTMTFormat.Rd @@ -35,11 +35,8 @@ We assume to use unique peptide for each protein.} \item{summaryforMultipleRows}{max or sum - when multiple PSMs identify the same feature within a single MS run (duplicate PSMs), use the highest (max) or sum of the duplicate intensities. Default is max for -label-free converters and sum for TMT converters. This parameter does -not control collapsing across fractions of the same biological mixture; -fraction handling is performed separately by \code{MSstatsBalancedDesign()} -(see the "Fractions and balanced design" section of the data format -vignette).} +label-free converters and sum for TMT converters. Note that this parameter +does NOT control collapsing across fractions of the same biological mixture.} \item{use_log_file}{logical. If TRUE, information about data processing will be saved to a file.} diff --git a/man/SpectroMinetoMSstatsTMTFormat.Rd b/man/SpectroMinetoMSstatsTMTFormat.Rd index e1f1757ea..efbf4bec3 100644 --- a/man/SpectroMinetoMSstatsTMTFormat.Rd +++ b/man/SpectroMinetoMSstatsTMTFormat.Rd @@ -39,11 +39,8 @@ We assume to use unique peptide for each protein.} \item{summaryforMultipleRows}{max or sum - when multiple PSMs identify the same feature within a single MS run (duplicate PSMs), use the highest (max) or sum of the duplicate intensities. Default is max for -label-free converters and sum for TMT converters. This parameter does -not control collapsing across fractions of the same biological mixture; -fraction handling is performed separately by \code{MSstatsBalancedDesign()} -(see the "Fractions and balanced design" section of the data format -vignette).} +label-free converters and sum for TMT converters. Note that this parameter +does NOT control collapsing across fractions of the same biological mixture.} \item{use_log_file}{logical. If TRUE, information about data processing will be saved to a file.} diff --git a/man/SpectronauttoMSstatsFormat.Rd b/man/SpectronauttoMSstatsFormat.Rd index a32e21e50..91caca283 100644 --- a/man/SpectronauttoMSstatsFormat.Rd +++ b/man/SpectronauttoMSstatsFormat.Rd @@ -78,11 +78,8 @@ We assume to use unique peptide for each protein.} \item{summaryforMultipleRows}{max or sum - when multiple PSMs identify the same feature within a single MS run (duplicate PSMs), use the highest (max) or sum of the duplicate intensities. Default is max for -label-free converters and sum for TMT converters. This parameter does -not control collapsing across fractions of the same biological mixture; -fraction handling is performed separately by \code{MSstatsBalancedDesign()} -(see the "Fractions and balanced design" section of the data format -vignette).} +label-free converters and sum for TMT converters. Note that this parameter +does NOT control collapsing across fractions of the same biological mixture.} \item{calculateAnomalyScores}{Default is FALSE. If TRUE, will run anomaly detection model and calculate anomaly scores for each feature. Used downstream to weigh measurements in differential analysis.} diff --git a/man/dot-sharedParametersAmongConverters.Rd b/man/dot-sharedParametersAmongConverters.Rd index 100a14053..9bf0ab9cc 100644 --- a/man/dot-sharedParametersAmongConverters.Rd +++ b/man/dot-sharedParametersAmongConverters.Rd @@ -15,11 +15,8 @@ We assume to use unique peptide for each protein.} \item{summaryforMultipleRows}{max or sum - when multiple PSMs identify the same feature within a single MS run (duplicate PSMs), use the highest (max) or sum of the duplicate intensities. Default is max for -label-free converters and sum for TMT converters. This parameter does -not control collapsing across fractions of the same biological mixture; -fraction handling is performed separately by \code{MSstatsBalancedDesign()} -(see the "Fractions and balanced design" section of the data format -vignette).} +label-free converters and sum for TMT converters. Note that this parameter +does NOT control collapsing across fractions of the same biological mixture.} \item{removeProtein_with1Feature}{TRUE will remove the proteins which have only 1 feature, which is the combination of peptide, precursor charge, fragment and charge. FALSE is default.} diff --git a/vignettes/msstats_data_format.Rmd b/vignettes/msstats_data_format.Rmd index b2ba9aef6..17bec0d53 100644 --- a/vignettes/msstats_data_format.Rmd +++ b/vignettes/msstats_data_format.Rmd @@ -307,7 +307,7 @@ should consists of elements named Finally, after preprocessing, the `MSstatsBalancedDesign` function can be applied to handle fractions and create a balanced design. -For label-free and SRM data, fractionation or technical replicates are +For label-free data, fractionation or technical replicates are detected if this information is not provided. Features that overlap across multiple fractions of the same sample are assigned to a single fraction by the following rule: for each feature, the fraction with the @@ -319,19 +319,11 @@ then adjusted so that within each fraction, every feature has a row for each run. If the intensity value is missing, it is denoted by `NA`. For TMT data, a unique fraction is selected for each overlapped feature -using a different cascade of criteria, applied in order until the -overlap is resolved: the fraction with the **highest mean intensity** -across channels is tried first, then the **highest sum**, then the -**highest single intensity** (`max`). If features still overlap after -all three passes, their intensities are averaged across fractions as a -final fallback. After fraction selection, the data are adjusted so +as well. After fraction selection, the data are adjusted so that within each run, every feature has a row for each channel. If the intensity is missing for a channel, it is denoted by `NA`. -These two rules differ by design: the label-free path optimizes for -the fraction with the most observations of the feature, while the TMT -path optimizes for the fraction with the strongest signal. Note also -that this fraction-collapsing logic is distinct from the +Note also that this fraction-collapsing logic is distinct from the `summaryforMultipleRows` argument on each converter, which only combines duplicate PSMs identifying the same feature within a single MS run.