Architecture

This document provides a technical deep dive into the deg.js library architecture, GPU implementations, and algorithms.

Overview

deg.js implements the Wilcoxon rank-sum test (Mann-Whitney U test) using GPU acceleration. The library supports two GPU backends (WebGPU and WebGL2) and multiple processing modes (standard, streaming, chunked) to handle datasets of various sizes.

┌─────────────────────────────────────────────────────────────┐
│                    rankGenesGroups()                        │
│                    (Main Entry Point)                       │
└─────────────────────────────────────────────────────────────┘
                            │
                            ▼
┌─────────────────────────────────────────────────────────────┐
│              Backend & Mode Resolution                       │
│  • Auto-select WebGPU or WebGL2                             │
│  • Auto-select standard, streaming, or chunked mode         │
│  • Auto-select bitonic or radix sort (WebGPU)               │
└─────────────────────────────────────────────────────────────┘
                            │
          ┌─────────────────┼─────────────────┐
          ▼                 ▼                 ▼
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│     WebGPU      │ │    WebGL2       │ │    WebGL2       │
│     Direct      │ │    Unified      │ │   Streaming     │
└─────────────────┘ └─────────────────┘ └─────────────────┘
          │                 │                 │
          └─────────────────┴─────────────────┘
                            │
                            ▼
┌─────────────────────────────────────────────────────────────┐
│                   Post-Processing                            │
│  • P-value calculation (Z → p-value)                        │
│  • P-value correction (BH or Bonferroni)                    │
│  • Log fold change computation                              │
│  • Top-N gene selection                                     │
└─────────────────────────────────────────────────────────────┘

Wilcoxon Rank-Sum Test Algorithm

Statistical Foundation

The Wilcoxon rank-sum test compares two distributions by ranking all values together:

Merge & Rank: Combine group cells with reference cells (or "rest"), rank all values
Sum Ranks: Calculate the sum of ranks for the group
Z-score: Compare observed rank sum to expected (normal approximation)
P-value: Convert Z-score to two-sided p-value

Mathematical Details

For each gene, comparing group G against reference R:

Given:
  n_g = number of cells in group G
  n_r = number of cells in reference R
  N = n_g + n_r (total cells)

Algorithm:
  1. Merge values: [v_g1, v_g2, ..., v_r1, v_r2, ...]
  2. Sort and assign ranks (average for ties)
  3. W = Σ(ranks of group G cells)

Normal Approximation:
  μ_W = n_g × (N + 1) / 2
  σ_W = √[T × n_g × n_r × (N + 1) / 12]

  where T = 1 - Σ(t³ - t) / (N³ - N) for tie correction
        t = number of values in each tied group

  Z = (W - μ_W) / σ_W
  p-value = 2 × Φ(-|Z|)  (two-sided)

GPU Backends

WebGPU

WebGPU provides compute shaders with direct GPU buffer access, offering the highest performance.

Advantages:

Native compute shader support (WGSL)
Direct buffer-to-buffer operations
Higher memory limits (storage buffers)
More efficient for large datasets

Components:

WebGPUCompute: Bitonic sort-based implementation
WebGPUComputeRadix: Radix sort-based implementation (O(n) complexity)

WebGL2

WebGL2 uses fragment shaders and texture operations to emulate compute functionality.

Advantages:

Broader browser support
Works on older hardware
Fallback for non-WebGPU browsers

Limitations:

Texture size limits (typically 16384 × 16384)
Requires texture-based workarounds for compute
Less efficient for very large datasets

Components:

UnifiedGPUCompute: Standard 1D texture layout
TiledUnifiedGPUCompute: 2D tiling for large datasets
BatchGPUCompute: For specific reference group comparisons

Sorting Algorithms

Bitonic Sort

Used by both WebGPU and WebGL2 backends.

Complexity: O(n log²n)

Algorithm:

for each stage k = 0 to log₂(n):
    for each step j = k down to 0:
        for each element i in parallel:
            partner = i XOR (1 << j)
            direction = (i >> (k+1)) & 1
            compare_exchange(a[i], a[partner], direction)

Characteristics:

Highly parallelizable (perfect for GPU)
Fixed comparison network (data-independent)
Good for small to medium datasets

Radix Sort (WebGPU Only)

Complexity: O(n)

Algorithm:

Extract bits at current position
Count occurrences per workgroup (histogram)
Prefix sum for scatter addresses
Scatter elements to sorted positions
Repeat for next bit position

Characteristics:

Linear complexity regardless of input distribution
Better for datasets > 10K cells
Higher per-operation overhead (more passes)

Auto-Selection

typescript

if (nCells >= 10_000 && backend === 'webgpu') {
  sortAlgorithm = 'radix';
} else {
  sortAlgorithm = 'bitonic';
}

Processing Modes

Standard Mode

All genes processed in a single GPU pass.

┌─────────────────────────────────────────────┐
│           Expression Matrix                  │
│         [nCells × nGenes]                   │
│                                             │
│  ┌─────┬─────┬─────┬─────┬─────┐           │
│  │ G0  │ G1  │ G2  │ ... │ Gn  │ ← Genes   │
│  ├─────┼─────┼─────┼─────┼─────┤           │
│  │ c0  │     │     │     │     │           │
│  │ c1  │     │     │     │     │           │
│  │ ... │     │     │     │     │           │
│  │ cm  │     │     │     │     │           │
│  └─────┴─────┴─────┴─────┴─────┘           │
│    ↓                                        │
│  GPU: Sort, Rank, Sum (all genes at once)   │
│    ↓                                        │
│  Results for all genes                      │
└─────────────────────────────────────────────┘

When used:

WebGPU: < 1000 genes
WebGL2: < 100K cells

Streaming Mode

Genes processed in chunks to manage memory.

┌──────────────────────────────────────────────┐
│  Chunk 1: Genes 0-499                        │
│  ┌─────────────────────────────────────────┐ │
│  │ GPU: Sort → Rank → Sum                  │ │
│  └─────────────────────────────────────────┘ │
│              ↓ Accumulate results            │
├──────────────────────────────────────────────┤
│  Chunk 2: Genes 500-999                      │
│  ┌─────────────────────────────────────────┐ │
│  │ GPU: Sort → Rank → Sum                  │ │
│  └─────────────────────────────────────────┘ │
│              ↓ Accumulate results            │
├──────────────────────────────────────────────┤
│  ...                                         │
├──────────────────────────────────────────────┤
│  Final: Combine all chunk results            │
└──────────────────────────────────────────────┘

When used:

WebGPU: ≥ 1000 genes
WebGL2: ≥ 100K cells

Chunked Input Mode

Data loaded on-demand via user callback.

┌──────────────────────────────────────────────┐
│  User provides: getGeneChunk(start, count)   │
└──────────────────────────────────────────────┘
              │
              ▼
┌──────────────────────────────────────────────┐
│  Library calls getGeneChunk(0, 100)          │
│  → User loads genes 0-99 from disk/network   │
│  → GPU processes chunk                       │
│  → Memory freed                              │
├──────────────────────────────────────────────┤
│  Library calls getGeneChunk(100, 100)        │
│  → User loads genes 100-199                  │
│  → GPU processes chunk                       │
│  → Memory freed                              │
├──────────────────────────────────────────────┤
│  ...                                         │
└──────────────────────────────────────────────┘

When used:

Dataset too large for memory
Data stored externally (HDF5, IndexedDB, network)

Memory Management

Chunk Size Calculation

typescript

// Bytes per gene for GPU processing
bytesPerGene = nCells × 16 + paddedCells × 16;

// Maximum genes per chunk based on memory limit
maxGenesPerChunk = (maxMemoryMB × 1024 × 1024) / bytesPerGene;

// Cap at 500 genes for GPU efficiency
chunkSize = Math.min(maxGenesPerChunk, 500);

WebGL2 Tiling

When data exceeds texture limits, 2D tiling is used:

Standard Layout (1D):
  [cell_0, cell_1, ..., cell_n] × nGenes rows

  Width: paddedCells (power of 2)
  Height: nGenes
  Limit: paddedCells ≤ maxTextureSize

Tiled Layout (2D):
  Gene 0: [row_0: cells 0-1023] [row_1: cells 1024-2047] ...
  Gene 1: [row_0: cells 0-1023] [row_1: cells 1024-2047] ...
  ...

  Width: cellWidth (typically 1024-4096)
  Height: cellRowsPerGene × nGenes
  Supports: millions of cells

Shader Pipeline

WebGPU Pipeline

1. Transpose & Init
   Row-major input → Column-major with indices

2. Sort (Bitonic or Radix)
   Parallel sorting of (value, originalIndex) pairs

3. Rank Compute
   Assign ranks with tie averaging
   Output: (rank, originalIndex, tieCount)

4. Rank Sum
   Masked reduction: sum ranks by group membership
   Output: rank sum per gene per group

5. LogFC Compute (optional)
   Group statistics and fold change on GPU

WebGL2 Pipeline

1. Upload Data
   Expression values → RGBA32F texture

2. Bitonic Sort (multiple passes)
   Ping-pong framebuffers for in-place sorting

3. Rank Compute
   Fragment shader: scan neighbors for ties

4. Masked Reduction
   Tree-based parallel sum with group mask

5. Read Results
   GPU → CPU transfer for final computation

Error Function Implementation

P-value computation requires the error function (erf). deg.js uses a polynomial approximation:

typescript

// Horner's method for erf approximation
// Abramowitz and Stegun formula 7.1.26
function erf(x: number): number {
  const a1 =  0.254829592;
  const a2 = -0.284496736;
  const a3 =  1.421413741;
  const a4 = -1.453152027;
  const a5 =  1.061405429;
  const p  =  0.3275911;

  const sign = x < 0 ? -1 : 1;
  x = Math.abs(x);

  const t = 1.0 / (1.0 + p * x);
  const y = 1.0 - (((((a5 * t + a4) * t) + a3) * t + a2) * t + a1) * t * Math.exp(-x * x);

  return sign * y;
}

P-value Correction

Benjamini-Hochberg (FDR)

typescript

function benjaminiHochberg(pvals: Float64Array): Float64Array {
  const n = pvals.length;
  const sorted = [...pvals].map((p, i) => ({ p, i }))
    .sort((a, b) => a.p - b.p);

  const adjusted = new Float64Array(n);
  let cumMin = 1.0;

  // Process in reverse order
  for (let rank = n; rank >= 1; rank--) {
    const { p, i } = sorted[rank - 1];
    const adj = Math.min(p * n / rank, 1.0);
    cumMin = Math.min(cumMin, adj);
    adjusted[i] = cumMin;
  }

  return adjusted;
}

Bonferroni

typescript

function bonferroni(pvals: Float64Array): Float64Array {
  const n = pvals.length;
  return pvals.map(p => Math.min(p * n, 1.0));
}

Performance Characteristics

Scenario	Backend	Mode	Complexity
< 1K genes, < 10K cells	WebGPU	Direct	O(n log²n) per gene
< 1K genes, ≥ 10K cells	WebGPU	Direct (Radix)	O(n) per gene
≥ 1K genes	WebGPU	Streaming	O(n) × chunks
< 100K cells	WebGL2	Unified	O(n log²n) per gene
≥ 100K cells	WebGL2	Streaming + Tiled	O(n log²n) × chunks × tiles

Directory Structure

src/
├── index.ts                 # Main entry point
├── core/                    # Algorithm implementations
│   ├── wilcoxon.ts         # Basic per-gene Wilcoxon
│   ├── wilcoxon-unified.ts # All-groups unified (WebGL2)
│   ├── wilcoxon-batch.ts   # Specific reference group
│   ├── wilcoxon-streaming.ts       # WebGL2 streaming
│   └── wilcoxon-streaming-webgpu.ts # WebGPU streaming
├── webgl/                   # WebGL2 implementation
│   ├── context.ts          # WebGL2 context management
│   ├── gpu-compute.ts      # Single-gene operations
│   ├── unified-gpu-compute.ts     # All-genes 1D
│   ├── tiled-gpu-compute.ts       # All-genes 2D tiled
│   └── shaders/            # GLSL fragment shaders
├── webgpu/                  # WebGPU implementation
│   ├── context.ts          # WebGPU context management
│   ├── gpu-compute.ts      # Bitonic sort compute
│   ├── gpu-compute-radix.ts # Radix sort compute
│   └── shaders/            # WGSL compute shaders
├── math/                    # Mathematical functions
│   ├── normal-sf.ts        # Error function, CDF
│   └── fold-change.ts      # Log fold change
├── utils/                   # Utilities
│   ├── indexing.ts         # Row/column-major helpers
│   ├── streaming.ts        # Streaming configuration
│   └── tiling.ts           # 2D layout calculation
└── types/                   # TypeScript types
    ├── input.ts            # Input types
    └── output.ts           # Result types

Architecture ​

Overview ​

Wilcoxon Rank-Sum Test Algorithm ​

Statistical Foundation ​

Mathematical Details ​

GPU Backends ​

WebGPU ​

WebGL2 ​

Sorting Algorithms ​

Bitonic Sort ​

Radix Sort (WebGPU Only) ​

Auto-Selection ​

Processing Modes ​

Standard Mode ​

Streaming Mode ​

Chunked Input Mode ​

Memory Management ​

Chunk Size Calculation ​

WebGL2 Tiling ​

Shader Pipeline ​

WebGPU Pipeline ​

WebGL2 Pipeline ​

Error Function Implementation ​

P-value Correction ​

Benjamini-Hochberg (FDR) ​

Bonferroni ​

Performance Characteristics ​

Directory Structure ​

See Also ​

Architecture

Overview

Wilcoxon Rank-Sum Test Algorithm

Statistical Foundation

Mathematical Details

GPU Backends

WebGPU

WebGL2

Sorting Algorithms

Bitonic Sort

Radix Sort (WebGPU Only)

Auto-Selection

Processing Modes

Standard Mode

Streaming Mode

Chunked Input Mode

Memory Management

Chunk Size Calculation

WebGL2 Tiling

Shader Pipeline

WebGPU Pipeline

WebGL2 Pipeline

Error Function Implementation

P-value Correction

Benjamini-Hochberg (FDR)

Bonferroni

Performance Characteristics

Directory Structure

See Also