Known differences from R limma ============================== pylimma is validated against R limma using pre-computed CSV fixtures generated by ``tests/fixtures/generate_all_fixtures.R``. The target tolerance is ``rtol=1e-6, atol=1e-12`` for deterministic statistics, and a log10-scale comparison for p-values. The sections below document every numerical gap that remains after porting. Accepted differences -------------------- Four differences remain after porting. All four are statistical artefacts of numerical-algorithm choices, not porting bugs. All are quantified, reproducible, and inside published tolerances. normexp saddle-point fit drifts up to ~2e-4 from R ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ``normexp_fit(method="saddle")`` parameters match R within ``rtol ~2e-4``; the objective function value agrees to ``rtol ~6e-9``. ``method="mle"`` matches R to ``rel ~2e-13``; ``method="rma"``, ``method="rma75"``, and ``normexp_signal`` all match R at floating-point precision. **Root cause.** scipy's Nelder-Mead and R's ``nmmin`` share the same algorithm and initial simplex but use different termination rules. R uses a relative f-range ``VH - VL < intol * (|VL| + intol)`` with ``intol = sqrt(eps) = 1.49e-8``, giving ``convtol ~4e-4`` at the saddle objective scale (``f ~ 25000``). scipy uses absolute ``fatol=1e-4`` and ``xatol=1e-4``. scipy consequently runs a few iterations further and lands at a slightly better f at a different point on the flat saddle-likelihood plateau. **Tolerance.** Parity tests for ``method="saddle"`` use ``rtol=1e-3``. All other normexp methods are at ``rtol=1e-10`` or better. normalize_vsn output drifts up to ~2.4e-4 from R/vsn 3.66.0 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ``normalize_vsn`` matches R's ``normalizeVSN`` (which delegates to the Bioconductor ``vsn`` package) within ``rtol ~2.4e-4``. The transform formula and ``hoffset`` rescaling are bit-faithful: at R's converged parameters pylimma reproduces R's output to ``rtol ~1e-7`` (verified by ``TestNormalizeVSNRParity::test_transform_at_r_params_matches_r_to_machine_precision``). The remaining drift comes entirely from the L-BFGS-B optimisation step. **Root cause.** The vsn profile log-likelihood is asymptotically flat under a uniform shift in the per-column scale parameter ``b``. In the large-``y`` limit ``arsinh(z) ~ log(2 z) = log(2) + b + log(y)``, so a uniform shift adds the same constant to every transformed cell, the row-mean centring absorbs it, and the per-stratum ``hoffset = log2(2 * exp(mean(b)))`` rescaling absorbs it again at the end. The likelihood becomes asymptotically flat under that direction. R/vsn calls LINPACK's ``lbfgsb`` and pylimma calls ``scipy.optimize.minimize(method="L-BFGS-B")``; under the same loose convergence tolerances the two implementations land at different points along the flat valley despite reporting near-identical negative log-likelihood (typically agreeing to four decimal places). Because the absorption is exact only asymptotically, the residual disagreement in the *transformed output* at finite ``y`` is the figure quoted above. **Tolerance.** Parity tests for ``normalize_vsn`` use ``rtol=5e-4`` for the end-to-end output and ``rtol=1e-6`` for the transform-at-R-params verification. **Note.** Because the likelihood is genuinely flat in this direction the divergence is irreducible without a regularisation term breaking the flat direction (which would itself be a deviation from R). pylimma chooses ``b=0`` (unit scale factor) as the L-BFGS-B starting point rather than R/vsn's ``b=1``: with ``b=0`` scipy's optimiser stays in the same valley region as R's, giving the ``2.4e-4`` figure above; with ``b=1`` (R's ``pstartHeuristic``) scipy walks to the opposite end of the valley, giving ``rtol ~4e-3``. This is a deliberate divergence from R's heuristic; ``pstart`` is documented as a heuristic in ``vsn/R/vsn2.R`` and is not exposed by limma's ``normalizeVSN.default``, so the change is invisible to limma users. Monte-Carlo rotation tests (``roast``, ``mroast``, ``romer``, ``gene_set_test``) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Rotation-based gene-set tests draw rotations from NumPy's PCG64 RNG. R uses the Mersenne Twister inside its own ``sample.int`` / ``rnorm`` C routines. The two streams cannot be aligned byte-for-byte from the same seed. Deterministic summaries match R to ``rtol=1e-15``: - ``ngenes_in_set`` - observed test statistics - active proportions ("PropDown" / "PropUp") - the rotated-effects matrix when the rotation seed is matched post-rotation (e.g. via ``compare_pvalues(max_log10_diff=0.5)``) Monte-Carlo p-values from ``roast`` / ``mroast`` / ``romer`` / ``gene_set_test`` agree with R within sampling error. **Tolerance.** Empirically ~0.3 log10 (factor of 2) between R and pylimma at ``nrot=999`` on the gene-set-testing fixture data, well inside the documented ``max_log10_diff=0.5`` threshold. mrlm stdev_unscaled drifts up to ~15% on machine-epsilon residuals ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ When a gene is perfectly fit by the design (initial OLS residuals at machine epsilon), ``mrlm`` and R ``MASS::rlm`` produce different ``stdev_unscaled`` values - up to 15% relative difference. The trigger is exclusive to synthetic perfectly-fit rows; real proteomics or RNA-seq residuals are 6+ orders of magnitude above machine epsilon and unaffected. **Root cause.** R's ``lm.wfit`` uses LINPACK ``DQRDC2`` which produces mixed-sign machine-epsilon residuals on a degenerate row. scipy's ``np.linalg.lstsq`` uses LAPACK SVD which produces uniform-sign residuals. The iter-1 MAD scale picks up R's mixed-sign noise pattern and Huber-downweights three samples as "outliers"; pylimma's MAD gives unit weights everywhere and returns the unweighted OLS stdev. Both implementations are computing deterministic numerical noise - just different noise patterns - because the residuals carry no information about the underlying linear model. **Tolerance.** A regression sentinel test (``tests/rigorous/test_mrlm.py::test_b9c_zero_residual_scale``) is left as ``xfail`` rather than loosened, so any future numerical change that aligns the two patterns is detected automatically. **Note.** Downstream impact is zero: when a row hits this regime ``sigma`` is also at machine epsilon, t-statistics are inf/NaN, and the empirical-Bayes posterior is dominated by the prior regardless of which "garbage" stdev was returned.