Known differences from R limma
pylimma is validated against R limma using pre-computed CSV fixtures
generated by tests/fixtures/generate_all_fixtures.R. The target
tolerance is rtol=1e-6, atol=1e-12 for deterministic statistics,
and a log10-scale comparison for p-values.
The sections below document every numerical gap that remains after porting.
Accepted differences
Four differences remain after porting. All four are statistical artefacts of numerical-algorithm choices, not porting bugs. All are quantified, reproducible, and inside published tolerances.
normexp saddle-point fit drifts up to ~2e-4 from R
normexp_fit(method="saddle") parameters match R within
rtol ~2e-4; the objective function value agrees to
rtol ~6e-9. method="mle" matches R to rel ~2e-13;
method="rma", method="rma75", and normexp_signal all match
R at floating-point precision.
Root cause. scipy’s Nelder-Mead and R’s nmmin share the same
algorithm and initial simplex but use different termination rules. R
uses a relative f-range VH - VL < intol * (|VL| + intol) with
intol = sqrt(eps) = 1.49e-8, giving convtol ~4e-4 at the
saddle objective scale (f ~ 25000). scipy uses absolute
fatol=1e-4 and xatol=1e-4. scipy consequently runs a few
iterations further and lands at a slightly better f at a different
point on the flat saddle-likelihood plateau.
Tolerance. Parity tests for method="saddle" use
rtol=1e-3. All other normexp methods are at rtol=1e-10 or
better.
normalize_vsn output drifts up to ~2.4e-4 from R/vsn 3.66.0
normalize_vsn matches R’s normalizeVSN (which delegates to
the Bioconductor vsn package) within rtol ~2.4e-4. The
transform formula and hoffset rescaling are bit-faithful: at
R’s converged parameters pylimma reproduces R’s output to
rtol ~1e-7 (verified by
TestNormalizeVSNRParity::test_transform_at_r_params_matches_r_to_machine_precision).
The remaining drift comes entirely from the L-BFGS-B optimisation
step.
Root cause. The vsn profile log-likelihood is asymptotically
flat under a uniform shift in the per-column scale parameter
b. In the large-y limit arsinh(z) ~ log(2 z) = log(2) +
b + log(y), so a uniform shift adds the same constant to every
transformed cell, the row-mean centring absorbs it, and the
per-stratum hoffset = log2(2 * exp(mean(b))) rescaling absorbs
it again at the end. The likelihood becomes asymptotically flat
under that direction. R/vsn calls LINPACK’s lbfgsb and
pylimma calls scipy.optimize.minimize(method="L-BFGS-B");
under the same loose convergence tolerances the two
implementations land at different points along the flat valley
despite reporting near-identical negative log-likelihood
(typically agreeing to four decimal places). Because the
absorption is exact only asymptotically, the residual disagreement
in the transformed output at finite y is the figure quoted
above.
Tolerance. Parity tests for normalize_vsn use
rtol=5e-4 for the end-to-end output and rtol=1e-6 for the
transform-at-R-params verification.
Note. Because the likelihood is genuinely flat in this
direction the divergence is irreducible without a regularisation
term breaking the flat direction (which would itself be a
deviation from R). pylimma chooses b=0 (unit scale factor) as
the L-BFGS-B starting point rather than R/vsn’s b=1: with
b=0 scipy’s optimiser stays in the same valley region as R’s,
giving the 2.4e-4 figure above; with b=1 (R’s
pstartHeuristic) scipy walks to the opposite end of the
valley, giving rtol ~4e-3. This is a deliberate divergence
from R’s heuristic; pstart is documented as a heuristic in
vsn/R/vsn2.R and is not exposed by limma’s
normalizeVSN.default, so the change is invisible to limma
users.
Monte-Carlo rotation tests (roast, mroast, romer, gene_set_test)
Rotation-based gene-set tests draw rotations from NumPy’s PCG64 RNG.
R uses the Mersenne Twister inside its own sample.int / rnorm
C routines. The two streams cannot be aligned byte-for-byte from the
same seed. Deterministic summaries match R to rtol=1e-15:
ngenes_in_setobserved test statistics
active proportions (“PropDown” / “PropUp”)
the rotated-effects matrix when the rotation seed is matched post-rotation (e.g. via
compare_pvalues(max_log10_diff=0.5))
Monte-Carlo p-values from roast / mroast / romer /
gene_set_test agree with R within sampling error.
Tolerance. Empirically ~0.3 log10 (factor of 2) between R and
pylimma at nrot=999 on the gene-set-testing fixture data, well
inside the documented max_log10_diff=0.5 threshold.
mrlm stdev_unscaled drifts up to ~15% on machine-epsilon residuals
When a gene is perfectly fit by the design (initial OLS residuals at
machine epsilon), mrlm and R MASS::rlm produce different
stdev_unscaled values - up to 15% relative difference. The trigger
is exclusive to synthetic perfectly-fit rows; real proteomics or
RNA-seq residuals are 6+ orders of magnitude above machine epsilon
and unaffected.
Root cause. R’s lm.wfit uses LINPACK DQRDC2 which produces
mixed-sign machine-epsilon residuals on a degenerate row. scipy’s
np.linalg.lstsq uses LAPACK SVD which produces uniform-sign
residuals. The iter-1 MAD scale picks up R’s mixed-sign noise pattern
and Huber-downweights three samples as “outliers”; pylimma’s MAD
gives unit weights everywhere and returns the unweighted OLS stdev.
Both implementations are computing deterministic numerical noise -
just different noise patterns - because the residuals carry no
information about the underlying linear model.
Tolerance. A regression sentinel test
(tests/rigorous/test_mrlm.py::test_b9c_zero_residual_scale) is
left as xfail rather than loosened, so any future numerical
change that aligns the two patterns is detected automatically.
Note. Downstream impact is zero: when a row hits this regime
sigma is also at machine epsilon, t-statistics are inf/NaN,
and the empirical-Bayes posterior is dominated by the prior
regardless of which “garbage” stdev was returned.