diff --git a/.gitignore b/.gitignore index a20707a..ab1d214 100644 --- a/.gitignore +++ b/.gitignore @@ -5,3 +5,4 @@ /docs/build/ Manifest.toml build/ +.DS_Store diff --git a/Project.toml b/Project.toml index 86ea0a5..427e8aa 100644 --- a/Project.toml +++ b/Project.toml @@ -1,7 +1,15 @@ name = "RidgeRegression" uuid = "739161c8-60e1-4c49-8f89-ff30998444b1" -authors = ["Vivak Patel "] version = "0.1.0" +authors = ["Eton Tackett ", "Vivak Patel "] + +[deps] +CSV = "336ed68f-0bac-5ca0-87d4-7b16caf5d00b" +DataFrames = "a93c6f00-e57d-5684-b7b6-d8193f3e46c0" +Downloads = "f43a241f-c20a-4ad4-852c-f6b1247861c6" [compat] +CSV = "0.10.15" +DataFrames = "1.8.1" +Downloads = "1.7.0" julia = "1.12.4" diff --git a/docs/make.jl b/docs/make.jl index a1097bb..d42cfbe 100644 --- a/docs/make.jl +++ b/docs/make.jl @@ -14,6 +14,7 @@ makedocs(; ), pages=[ "Home" => "index.md", + "Design" => "design.md", ], ) diff --git a/docs/src/design.md b/docs/src/design.md new file mode 100644 index 0000000..d15a807 --- /dev/null +++ b/docs/src/design.md @@ -0,0 +1,162 @@ +# Motivation and Background +Many modern science problems involve regression problems with extremely large numbers of +predictors. +[Genome-wide association studies (GWAS)](https://doi.org/10.1371/journal.pone.0245376), +for example, try to identify genetic +variants associated with a disease phenotype using hundreds of thousands or millions of +genomic features. In such settings, traditional least squares methods fail. +Penalized Least Squares (PLS) extends ordinary least squares (OLS) +regression by adding a penalty term to shrink parameter estimates. Ridge regression, an +approach within PLS, adds a regularization term, producing a regularized estimator. + +Mathematically, ridge regression estimates the regression coefficients by solving the +penalized least squares problem +```math +\hat{\boldsymbol{\beta}} = +\arg\min_{\boldsymbol{\beta}} +\left( +\| \mathbf{y} - X\boldsymbol{\beta} \|_2^2 ++ +\lambda \| \boldsymbol{\beta} \|_2^2 +\right) +``` +where $\lambda > 0$ is a regularization parameter that controls the strength of the penalty. + +The purpose of ridge regression is to stabilize regression estimates when the predictors are +highly correlated or the design matrix $X$ is nearly singular. Ridge regression modifies the +least squares objective by adding a penalty on the squared $\ell_2$-norm of the coefficient +vector. The estimator is obtained by minimizing a penalized least squares objective in which +large coefficient values are discouraged through the penalty term $\lambda +\|\boldsymbol{\beta}\|_2^2$. This penalty shrinks the estimated coefficients toward the +origin, which reduces the variance of the estimator and mitigates the effects of +multicollinearity. + +There are many numerical algorithms available to compute ridge regression estimates +including direct methods, Krylov subspace methods, gradient-based optimization, coordinate +descent, and stochastic gradient descent. These algorithms differ in their computational +costs and numerical stability. + +The goal of this experiment is to investigate the performance of these algorithms when we +vary the structure and scale of the regression problem. To do this, we consider the linear +model $\mathbf{y} = X\boldsymbol{\beta} + \boldsymbol{\varepsilon}$ where the matrix ${X}$ +may be constructed with varying dimensions, sparsity patterns, and conditioning properties. +# Questions +The primary goal of this experiment is to compare numerical algorithms for computing ridge +regression estimates under various conditions. In particular, we aim to address the +following questions: + +1. How does the performance of ridge regression algorithms change as the structural + and numerical properties of the regression problem vary? + +2. Which ridge regression algorithm provides the best balance between numerical + stability and computational cost across these problem regimes? + +# Experimental Units +An experimental unit is defined by a triplet: a design matrix, $X$; a response vector, $y$; +and a non-negative regularization parameter $\lambda$. +Experimental units are selected based on factors we believe will potentially impact +the performance of a specific algorithm. +These factors, or blocks, correspond to: +- The dimensional regime of $X \in \mathbb{R}^{n \times p}$: + $p \ll n$, $p ≈ n$, and $p \gg n$. +- The sparsity of the design matrix: measured as a fraction in $[0,1]$. +- The magnitude of the regularization parameter, $\lambda$, as described below. + +The regularization parameter is often selected using one of the following strategies, +depending on application domain: + +- [Generalized Cross Validation](https://doi.org/10.1080/00401706.1979.10489751) +- [The L-Curve Method](https://doi.org/10.1137/1034115) +- [Information Criteria](https://doi.org/10.1002/wics.1607) +- [Morozov's Discrepancy Principle](https://doi.org/10.1088/0266-5611/26/2/025001) + +Excluding Morozov's discrepancy principle, the remaining methods require +computing $\hat{\beta}_\lambda = (X^\top X + \lambda I)^{-1}X^\top y$ or +$\Vert X\hat{\beta}_\lambda - y \Vert_2$ over a grid of values for $\lambda$. +Given our assumption that problem conditioning will impact algorithm behavior, +we will choose this grid to modify the problem's condition number systematically. + +Recall, the ridge estimator's closed form is given by solving +```math +\min_{\beta} \left\Vert \begin{bmatrix} X \\ \lambda I \end{bmatrix}\beta - +y \right\Vert_2^2, +``` +or (equivalently, from a mathematical perspective) solving +$(X^\top X + \lambda I)\hat{\beta}_\lambda = X^\top y$. +Let $\sigma_{\max}$ denote the largest singular value +of $X$, and let $\sigma_{\min}$ denote the smallest singular value of +$X$ (note, $0$ is allowed). +Then the condition number for the normal systems approach is +```math +\kappa(X^\top X+\lambda I) += +\frac{\sigma_{\max}^2+\lambda}{\sigma_{\min}^2+\lambda}. +``` + +There are two cases we consider. +- Case 1: $\sigma_{\min}=0$. In this case, the least squares problem is ill-posed. We choose + the smallest value of $\lambda$ in the grid to be + $\sigma_{\max}^2 / (10^{12} - 1)$, which ensures + $\kappa(X^\top X + \lambda I) = 10^{12}$. +- Case 2: $\sigma_{\min}>0$. In this case, we choose the smallest value of $\lambda$ in the + grid to be $\sigma_{\max}\sigma_{\min}$, which ensures + $\kappa(X^\top X + \lambda I) = \sigma_{\max}/\sigma_{\min}$. + That is, it sets the regularized normal system's condition number to that of the original + least squares problem. + +In both cases, assuming $\sigma_{\max}^2 > 2 \sigma_{\min}$, we choose the larges value of +$\lambda$ to be $\sigma_{\max}^2 - 2 \sigma_{\min}$. The condition number for the +regularized normal equations is $\kappa(X^\top X + \lambda I)=2$. + + +The total number of block combinations is determined by the product of the number of levels +in each blocking factor, denoted b. For example, if the experiment includes three +dimensional regimes, two sparsity levels, and two regularization strengths, then there are +$3 * 2 * 2 = 12$ block combinations. We will also denote r to be the number of replicated +datasets in each block. Here, we mean the number datasets within a block. The total number +of experimental units is then ${b * r}$. + + +| Blocking System | Factor | Blocks | +|:----------------|:-------|:-------| +| Dataset | Dimensional regime | $(p \ll n)$, $(p \approx n)$, $(p \gg n)$| +| Ridge Penalty | Magnitude of ${\lambda}$ relative to the spectral scale of $X^\top X$ | Strong: $\kappa \approx 10$, Moderate: $\kappa \approx 10^3$, Weak: $\kappa \approx 10^6$, with $\lambda$ computed from the corresponding $\epsilon$. | +| Matrix Sparsity| Density of non-zero values in $X$ | Sparse (< 10% non-zero), Moderate (10%-50% non-zero), Dense (> 50% non-zero)| + +# Treatments + +The treatments are the ridge regression solution methods: + +- Gradient-based optimization +- Stochastic gradient descent +- Direct Methods + - Golub Kahan Bidiagonalization + + Since each experimental unit will recieves all t treatments, the total number of algorithm + runs in the experiment is ${t * b * r}$. For this experiment, ${t=3}$. To ensure fair + comparison between algorithms, each treatment will be applied under a fixed wall time + constraint. Each algorithm will be run for a maximum of two hours per experimental unit. +# Observational Units and Measurements + +The observational units are each algorithm-dataset pair. For each combination we will observe the following + +| Column Name | Data Type | Description | +|:---|:---|:---| +| `dataset_id` | Positive Integer | Identifier for the generated dataset (experimental unit). | +| `dimensional_regime` | String | Relationship between predictors and observations: `p << n`, `p ≈ n`, or `p >> n`. | +| `sparsity_level` | String | Density of the matrix `X`: `Sparse`, `Moderate`, or `Dense`. | +| `lambda_level` | String | Relative magnitude of the ridge penalty parameter `λ`: `Weak`, `Moderate`, or `Strong`. | +| `algorithm` | String | Ridge regression solution method used: `GradientDescent`, `SGD`, or `DirectMethod`. | +| `runtime_seconds` | Positive Floating-point | Time required for the algorithm to compute a solution. | +| `iterations` | Positive Integer | Number of iterations performed by the algorithm (`NA` for direct methods). | +| `step_size` | Positive Floating-point | Step size used in gradient descent or SGD (`NA` for direct methods). | +| `batch_size` | Positive Integer | Number of samples used per SGD update (`NA` for direct methods and gradient descent). | +| `number_of_epochs` | Positive Integer | Number epochs per observation (`NA` for direct methods). | + + + +The collected measurements will be written to a CSV file. Each row in the file corresponds +to a single algorithm–dataset pair, which forms the observational unit of the experiment. +The columns represent the recorded measurements. After the experiment, the resulting CSV +file should contain ${Algorithms∗Datasets}$ number of rows and each row will contain exactly +10 columns. \ No newline at end of file diff --git a/test/Project.toml b/test/Project.toml index 0c36332..73141b0 100644 --- a/test/Project.toml +++ b/test/Project.toml @@ -1,2 +1,5 @@ [deps] +CSV = "336ed68f-0bac-5ca0-87d4-7b16caf5d00b" +DataFrames = "a93c6f00-e57d-5684-b7b6-d8193f3e46c0" Test = "8dfed614-e22c-5e08-85e1-65c5234f0b40" +LinearAlgebra = "37e2e46d-f89d-539d-b4ee-838fcccc9c8e" \ No newline at end of file