Architecture
This document provides a technical deep dive into the deg.js library architecture, GPU implementations, and algorithms.
Overview
deg.js implements the Wilcoxon rank-sum test (Mann-Whitney U test) using GPU acceleration. The library supports two GPU backends (WebGPU and WebGL2) and multiple processing modes (standard, streaming, chunked) to handle datasets of various sizes.
┌─────────────────────────────────────────────────────────────┐
│ rankGenesGroups() │
│ (Main Entry Point) │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ Backend & Mode Resolution │
│ • Auto-select WebGPU or WebGL2 │
│ • Auto-select standard, streaming, or chunked mode │
│ • Auto-select bitonic or radix sort (WebGPU) │
└─────────────────────────────────────────────────────────────┘
│
┌─────────────────┼─────────────────┐
▼ ▼ ▼
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ WebGPU │ │ WebGL2 │ │ WebGL2 │
│ Direct │ │ Unified │ │ Streaming │
└─────────────────┘ └─────────────────┘ └─────────────────┘
│ │ │
└─────────────────┴─────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ Post-Processing │
│ • P-value calculation (Z → p-value) │
│ • P-value correction (BH or Bonferroni) │
│ • Log fold change computation │
│ • Top-N gene selection │
└─────────────────────────────────────────────────────────────┘Wilcoxon Rank-Sum Test Algorithm
Statistical Foundation
The Wilcoxon rank-sum test compares two distributions by ranking all values together:
- Merge & Rank: Combine group cells with reference cells (or "rest"), rank all values
- Sum Ranks: Calculate the sum of ranks for the group
- Z-score: Compare observed rank sum to expected (normal approximation)
- P-value: Convert Z-score to two-sided p-value
Mathematical Details
For each gene, comparing group G against reference R:
Given:
n_g = number of cells in group G
n_r = number of cells in reference R
N = n_g + n_r (total cells)
Algorithm:
1. Merge values: [v_g1, v_g2, ..., v_r1, v_r2, ...]
2. Sort and assign ranks (average for ties)
3. W = Σ(ranks of group G cells)
Normal Approximation:
μ_W = n_g × (N + 1) / 2
σ_W = √[T × n_g × n_r × (N + 1) / 12]
where T = 1 - Σ(t³ - t) / (N³ - N) for tie correction
t = number of values in each tied group
Z = (W - μ_W) / σ_W
p-value = 2 × Φ(-|Z|) (two-sided)GPU Backends
WebGPU
WebGPU provides compute shaders with direct GPU buffer access, offering the highest performance.
Advantages:
- Native compute shader support (WGSL)
- Direct buffer-to-buffer operations
- Higher memory limits (storage buffers)
- More efficient for large datasets
Components:
WebGPUCompute: Bitonic sort-based implementationWebGPUComputeRadix: Radix sort-based implementation (O(n) complexity)
WebGL2
WebGL2 uses fragment shaders and texture operations to emulate compute functionality.
Advantages:
- Broader browser support
- Works on older hardware
- Fallback for non-WebGPU browsers
Limitations:
- Texture size limits (typically 16384 × 16384)
- Requires texture-based workarounds for compute
- Less efficient for very large datasets
Components:
UnifiedGPUCompute: Standard 1D texture layoutTiledUnifiedGPUCompute: 2D tiling for large datasetsBatchGPUCompute: For specific reference group comparisons
Sorting Algorithms
Bitonic Sort
Used by both WebGPU and WebGL2 backends.
Complexity: O(n log²n)
Algorithm:
for each stage k = 0 to log₂(n):
for each step j = k down to 0:
for each element i in parallel:
partner = i XOR (1 << j)
direction = (i >> (k+1)) & 1
compare_exchange(a[i], a[partner], direction)Characteristics:
- Highly parallelizable (perfect for GPU)
- Fixed comparison network (data-independent)
- Good for small to medium datasets
Radix Sort (WebGPU Only)
Complexity: O(n)
Algorithm:
- Extract bits at current position
- Count occurrences per workgroup (histogram)
- Prefix sum for scatter addresses
- Scatter elements to sorted positions
- Repeat for next bit position
Characteristics:
- Linear complexity regardless of input distribution
- Better for datasets > 10K cells
- Higher per-operation overhead (more passes)
Auto-Selection
if (nCells >= 10_000 && backend === 'webgpu') {
sortAlgorithm = 'radix';
} else {
sortAlgorithm = 'bitonic';
}Processing Modes
Standard Mode
All genes processed in a single GPU pass.
┌─────────────────────────────────────────────┐
│ Expression Matrix │
│ [nCells × nGenes] │
│ │
│ ┌─────┬─────┬─────┬─────┬─────┐ │
│ │ G0 │ G1 │ G2 │ ... │ Gn │ ← Genes │
│ ├─────┼─────┼─────┼─────┼─────┤ │
│ │ c0 │ │ │ │ │ │
│ │ c1 │ │ │ │ │ │
│ │ ... │ │ │ │ │ │
│ │ cm │ │ │ │ │ │
│ └─────┴─────┴─────┴─────┴─────┘ │
│ ↓ │
│ GPU: Sort, Rank, Sum (all genes at once) │
│ ↓ │
│ Results for all genes │
└─────────────────────────────────────────────┘When used:
- WebGPU: < 1000 genes
- WebGL2: < 100K cells
Streaming Mode
Genes processed in chunks to manage memory.
┌──────────────────────────────────────────────┐
│ Chunk 1: Genes 0-499 │
│ ┌─────────────────────────────────────────┐ │
│ │ GPU: Sort → Rank → Sum │ │
│ └─────────────────────────────────────────┘ │
│ ↓ Accumulate results │
├──────────────────────────────────────────────┤
│ Chunk 2: Genes 500-999 │
│ ┌─────────────────────────────────────────┐ │
│ │ GPU: Sort → Rank → Sum │ │
│ └─────────────────────────────────────────┘ │
│ ↓ Accumulate results │
├──────────────────────────────────────────────┤
│ ... │
├──────────────────────────────────────────────┤
│ Final: Combine all chunk results │
└──────────────────────────────────────────────┘When used:
- WebGPU: ≥ 1000 genes
- WebGL2: ≥ 100K cells
Chunked Input Mode
Data loaded on-demand via user callback.
┌──────────────────────────────────────────────┐
│ User provides: getGeneChunk(start, count) │
└──────────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────┐
│ Library calls getGeneChunk(0, 100) │
│ → User loads genes 0-99 from disk/network │
│ → GPU processes chunk │
│ → Memory freed │
├──────────────────────────────────────────────┤
│ Library calls getGeneChunk(100, 100) │
│ → User loads genes 100-199 │
│ → GPU processes chunk │
│ → Memory freed │
├──────────────────────────────────────────────┤
│ ... │
└──────────────────────────────────────────────┘When used:
- Dataset too large for memory
- Data stored externally (HDF5, IndexedDB, network)
Memory Management
Chunk Size Calculation
// Bytes per gene for GPU processing
bytesPerGene = nCells × 16 + paddedCells × 16;
// Maximum genes per chunk based on memory limit
maxGenesPerChunk = (maxMemoryMB × 1024 × 1024) / bytesPerGene;
// Cap at 500 genes for GPU efficiency
chunkSize = Math.min(maxGenesPerChunk, 500);WebGL2 Tiling
When data exceeds texture limits, 2D tiling is used:
Standard Layout (1D):
[cell_0, cell_1, ..., cell_n] × nGenes rows
Width: paddedCells (power of 2)
Height: nGenes
Limit: paddedCells ≤ maxTextureSize
Tiled Layout (2D):
Gene 0: [row_0: cells 0-1023] [row_1: cells 1024-2047] ...
Gene 1: [row_0: cells 0-1023] [row_1: cells 1024-2047] ...
...
Width: cellWidth (typically 1024-4096)
Height: cellRowsPerGene × nGenes
Supports: millions of cellsShader Pipeline
WebGPU Pipeline
1. Transpose & Init
Row-major input → Column-major with indices
2. Sort (Bitonic or Radix)
Parallel sorting of (value, originalIndex) pairs
3. Rank Compute
Assign ranks with tie averaging
Output: (rank, originalIndex, tieCount)
4. Rank Sum
Masked reduction: sum ranks by group membership
Output: rank sum per gene per group
5. LogFC Compute (optional)
Group statistics and fold change on GPUWebGL2 Pipeline
1. Upload Data
Expression values → RGBA32F texture
2. Bitonic Sort (multiple passes)
Ping-pong framebuffers for in-place sorting
3. Rank Compute
Fragment shader: scan neighbors for ties
4. Masked Reduction
Tree-based parallel sum with group mask
5. Read Results
GPU → CPU transfer for final computationError Function Implementation
P-value computation requires the error function (erf). deg.js uses a polynomial approximation:
// Horner's method for erf approximation
// Abramowitz and Stegun formula 7.1.26
function erf(x: number): number {
const a1 = 0.254829592;
const a2 = -0.284496736;
const a3 = 1.421413741;
const a4 = -1.453152027;
const a5 = 1.061405429;
const p = 0.3275911;
const sign = x < 0 ? -1 : 1;
x = Math.abs(x);
const t = 1.0 / (1.0 + p * x);
const y = 1.0 - (((((a5 * t + a4) * t) + a3) * t + a2) * t + a1) * t * Math.exp(-x * x);
return sign * y;
}P-value Correction
Benjamini-Hochberg (FDR)
function benjaminiHochberg(pvals: Float64Array): Float64Array {
const n = pvals.length;
const sorted = [...pvals].map((p, i) => ({ p, i }))
.sort((a, b) => a.p - b.p);
const adjusted = new Float64Array(n);
let cumMin = 1.0;
// Process in reverse order
for (let rank = n; rank >= 1; rank--) {
const { p, i } = sorted[rank - 1];
const adj = Math.min(p * n / rank, 1.0);
cumMin = Math.min(cumMin, adj);
adjusted[i] = cumMin;
}
return adjusted;
}Bonferroni
function bonferroni(pvals: Float64Array): Float64Array {
const n = pvals.length;
return pvals.map(p => Math.min(p * n, 1.0));
}Performance Characteristics
| Scenario | Backend | Mode | Complexity |
|---|---|---|---|
| < 1K genes, < 10K cells | WebGPU | Direct | O(n log²n) per gene |
| < 1K genes, ≥ 10K cells | WebGPU | Direct (Radix) | O(n) per gene |
| ≥ 1K genes | WebGPU | Streaming | O(n) × chunks |
| < 100K cells | WebGL2 | Unified | O(n log²n) per gene |
| ≥ 100K cells | WebGL2 | Streaming + Tiled | O(n log²n) × chunks × tiles |
Directory Structure
src/
├── index.ts # Main entry point
├── core/ # Algorithm implementations
│ ├── wilcoxon.ts # Basic per-gene Wilcoxon
│ ├── wilcoxon-unified.ts # All-groups unified (WebGL2)
│ ├── wilcoxon-batch.ts # Specific reference group
│ ├── wilcoxon-streaming.ts # WebGL2 streaming
│ └── wilcoxon-streaming-webgpu.ts # WebGPU streaming
├── webgl/ # WebGL2 implementation
│ ├── context.ts # WebGL2 context management
│ ├── gpu-compute.ts # Single-gene operations
│ ├── unified-gpu-compute.ts # All-genes 1D
│ ├── tiled-gpu-compute.ts # All-genes 2D tiled
│ └── shaders/ # GLSL fragment shaders
├── webgpu/ # WebGPU implementation
│ ├── context.ts # WebGPU context management
│ ├── gpu-compute.ts # Bitonic sort compute
│ ├── gpu-compute-radix.ts # Radix sort compute
│ └── shaders/ # WGSL compute shaders
├── math/ # Mathematical functions
│ ├── normal-sf.ts # Error function, CDF
│ └── fold-change.ts # Log fold change
├── utils/ # Utilities
│ ├── indexing.ts # Row/column-major helpers
│ ├── streaming.ts # Streaming configuration
│ └── tiling.ts # 2D layout calculation
└── types/ # TypeScript types
├── input.ts # Input types
└── output.ts # Result typesSee Also
- Getting Started - Basic usage
- API Reference - Function documentation
- Advanced Usage - Streaming and chunked loading