Known differences from R limma

pylimma is validated against R limma using pre-computed CSV fixtures generated by tests/fixtures/generate_all_fixtures.R. The target tolerance is rtol=1e-6, atol=1e-12 for deterministic statistics, and a log10-scale comparison for p-values.

The sections below document every numerical gap that remains after porting.

Accepted differences

Four differences remain after porting. All four are statistical artefacts of numerical-algorithm choices, not porting bugs. All are quantified, reproducible, and inside published tolerances.

normexp saddle-point fit drifts up to ~2e-4 from R

normexp_fit(method="saddle") parameters match R within rtol ~2e-4; the objective function value agrees to rtol ~6e-9. method="mle" matches R to rel ~2e-13; method="rma", method="rma75", and normexp_signal all match R at floating-point precision.

Root cause. scipy’s Nelder-Mead and R’s nmmin share the same algorithm and initial simplex but use different termination rules. R uses a relative f-range VH - VL < intol * (|VL| + intol) with intol = sqrt(eps) = 1.49e-8, giving convtol ~4e-4 at the saddle objective scale (f ~ 25000). scipy uses absolute fatol=1e-4 and xatol=1e-4. scipy consequently runs a few iterations further and lands at a slightly better f at a different point on the flat saddle-likelihood plateau.

Tolerance. Parity tests for method="saddle" use rtol=1e-3. All other normexp methods are at rtol=1e-10 or better.

normalize_vsn output drifts up to ~2.4e-4 from R/vsn 3.66.0

normalize_vsn matches R’s normalizeVSN (which delegates to the Bioconductor vsn package) within rtol ~2.4e-4. The transform formula and hoffset rescaling are bit-faithful: at R’s converged parameters pylimma reproduces R’s output to rtol ~1e-7 (verified by TestNormalizeVSNRParity::test_transform_at_r_params_matches_r_to_machine_precision). The remaining drift comes entirely from the L-BFGS-B optimisation step.

Root cause. The vsn profile log-likelihood is asymptotically flat under a uniform shift in the per-column scale parameter b. In the large-y limit arsinh(z) ~ log(2 z) = log(2) + b + log(y), so a uniform shift adds the same constant to every transformed cell, the row-mean centring absorbs it, and the per-stratum hoffset = log2(2 * exp(mean(b))) rescaling absorbs it again at the end. The likelihood becomes asymptotically flat under that direction. R/vsn calls LINPACK’s lbfgsb and pylimma calls scipy.optimize.minimize(method="L-BFGS-B"); under the same loose convergence tolerances the two implementations land at different points along the flat valley despite reporting near-identical negative log-likelihood (typically agreeing to four decimal places). Because the absorption is exact only asymptotically, the residual disagreement in the transformed output at finite y is the figure quoted above.

Tolerance. Parity tests for normalize_vsn use rtol=5e-4 for the end-to-end output and rtol=1e-6 for the transform-at-R-params verification.

Note. Because the likelihood is genuinely flat in this direction the divergence is irreducible without a regularisation term breaking the flat direction (which would itself be a deviation from R). pylimma chooses b=0 (unit scale factor) as the L-BFGS-B starting point rather than R/vsn’s b=1: with b=0 scipy’s optimiser stays in the same valley region as R’s, giving the 2.4e-4 figure above; with b=1 (R’s pstartHeuristic) scipy walks to the opposite end of the valley, giving rtol ~4e-3. This is a deliberate divergence from R’s heuristic; pstart is documented as a heuristic in vsn/R/vsn2.R and is not exposed by limma’s normalizeVSN.default, so the change is invisible to limma users.

Monte-Carlo rotation tests (`roast`, `mroast`, `romer`, `gene_set_test`)

Rotation-based gene-set tests draw rotations from NumPy’s PCG64 RNG. R uses the Mersenne Twister inside its own sample.int / rnorm C routines. The two streams cannot be aligned byte-for-byte from the same seed. Deterministic summaries match R to rtol=1e-15:

ngenes_in_set
observed test statistics
active proportions (“PropDown” / “PropUp”)
the rotated-effects matrix when the rotation seed is matched post-rotation (e.g. via compare_pvalues(max_log10_diff=0.5))

Monte-Carlo p-values from roast / mroast / romer / gene_set_test agree with R within sampling error.

Tolerance. Empirically ~0.3 log10 (factor of 2) between R and pylimma at nrot=999 on the gene-set-testing fixture data, well inside the documented max_log10_diff=0.5 threshold.

mrlm stdev_unscaled drifts up to ~15% on machine-epsilon residuals

When a gene is perfectly fit by the design (initial OLS residuals at machine epsilon), mrlm and R MASS::rlm produce different stdev_unscaled values - up to 15% relative difference. The trigger is exclusive to synthetic perfectly-fit rows; real proteomics or RNA-seq residuals are 6+ orders of magnitude above machine epsilon and unaffected.

Root cause. R’s lm.wfit uses LINPACK DQRDC2 which produces mixed-sign machine-epsilon residuals on a degenerate row. scipy’s np.linalg.lstsq uses LAPACK SVD which produces uniform-sign residuals. The iter-1 MAD scale picks up R’s mixed-sign noise pattern and Huber-downweights three samples as “outliers”; pylimma’s MAD gives unit weights everywhere and returns the unweighted OLS stdev. Both implementations are computing deterministic numerical noise - just different noise patterns - because the residuals carry no information about the underlying linear model.

Tolerance. A regression sentinel test (tests/rigorous/test_mrlm.py::test_b9c_zero_residual_scale) is left as xfail rather than loosened, so any future numerical change that aligns the two patterns is detected automatically.

Note. Downstream impact is zero: when a row hits this regime sigma is also at machine epsilon, t-statistics are inf/NaN, and the empirical-Bayes posterior is dominated by the prior regardless of which “garbage” stdev was returned.