Known differences from R limma
==============================

pylimma is validated against R limma using pre-computed CSV fixtures
generated by ``tests/fixtures/generate_all_fixtures.R``. The target
tolerance is ``rtol=1e-6, atol=1e-12`` for deterministic statistics,
and a log10-scale comparison for p-values.

The sections below document every numerical gap that remains after
porting.

Accepted differences
--------------------

Four differences remain after porting. All four are statistical
artefacts of numerical-algorithm choices, not porting bugs. All are
quantified, reproducible, and inside published tolerances.

normexp saddle-point fit drifts up to ~2e-4 from R
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

``normexp_fit(method="saddle")`` parameters match R within
``rtol ~2e-4``; the objective function value agrees to
``rtol ~6e-9``. ``method="mle"`` matches R to ``rel ~2e-13``;
``method="rma"``, ``method="rma75"``, and ``normexp_signal`` all match
R at floating-point precision.

**Root cause.** scipy's Nelder-Mead and R's ``nmmin`` share the same
algorithm and initial simplex but use different termination rules. R
uses a relative f-range ``VH - VL < intol * (|VL| + intol)`` with
``intol = sqrt(eps) = 1.49e-8``, giving ``convtol ~4e-4`` at the
saddle objective scale (``f ~ 25000``). scipy uses absolute
``fatol=1e-4`` and ``xatol=1e-4``. scipy consequently runs a few
iterations further and lands at a slightly better f at a different
point on the flat saddle-likelihood plateau.

**Tolerance.** Parity tests for ``method="saddle"`` use
``rtol=1e-3``. All other normexp methods are at ``rtol=1e-10`` or
better.

normalize_vsn output drifts up to ~2.4e-4 from R/vsn 3.66.0
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

``normalize_vsn`` matches R's ``normalizeVSN`` (which delegates to
the Bioconductor ``vsn`` package) within ``rtol ~2.4e-4``. The
transform formula and ``hoffset`` rescaling are bit-faithful: at
R's converged parameters pylimma reproduces R's output to
``rtol ~1e-7`` (verified by
``TestNormalizeVSNRParity::test_transform_at_r_params_matches_r_to_machine_precision``).
The remaining drift comes entirely from the L-BFGS-B optimisation
step.

**Root cause.** The vsn profile log-likelihood is asymptotically
flat under a uniform shift in the per-column scale parameter
``b``. In the large-``y`` limit ``arsinh(z) ~ log(2 z) = log(2) +
b + log(y)``, so a uniform shift adds the same constant to every
transformed cell, the row-mean centring absorbs it, and the
per-stratum ``hoffset = log2(2 * exp(mean(b)))`` rescaling absorbs
it again at the end. The likelihood becomes asymptotically flat
under that direction. R/vsn calls LINPACK's ``lbfgsb`` and
pylimma calls ``scipy.optimize.minimize(method="L-BFGS-B")``;
under the same loose convergence tolerances the two
implementations land at different points along the flat valley
despite reporting near-identical negative log-likelihood
(typically agreeing to four decimal places). Because the
absorption is exact only asymptotically, the residual disagreement
in the *transformed output* at finite ``y`` is the figure quoted
above.

**Tolerance.** Parity tests for ``normalize_vsn`` use
``rtol=5e-4`` for the end-to-end output and ``rtol=1e-6`` for the
transform-at-R-params verification.

**Note.** Because the likelihood is genuinely flat in this
direction the divergence is irreducible without a regularisation
term breaking the flat direction (which would itself be a
deviation from R). pylimma chooses ``b=0`` (unit scale factor) as
the L-BFGS-B starting point rather than R/vsn's ``b=1``: with
``b=0`` scipy's optimiser stays in the same valley region as R's,
giving the ``2.4e-4`` figure above; with ``b=1`` (R's
``pstartHeuristic``) scipy walks to the opposite end of the
valley, giving ``rtol ~4e-3``. This is a deliberate divergence
from R's heuristic; ``pstart`` is documented as a heuristic in
``vsn/R/vsn2.R`` and is not exposed by limma's
``normalizeVSN.default``, so the change is invisible to limma
users.

Monte-Carlo rotation tests (``roast``, ``mroast``, ``romer``, ``gene_set_test``)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Rotation-based gene-set tests draw rotations from NumPy's PCG64 RNG.
R uses the Mersenne Twister inside its own ``sample.int`` / ``rnorm``
C routines. The two streams cannot be aligned byte-for-byte from the
same seed. Deterministic summaries match R to ``rtol=1e-15``:

- ``ngenes_in_set``
- observed test statistics
- active proportions ("PropDown" / "PropUp")
- the rotated-effects matrix when the rotation seed is matched
  post-rotation (e.g. via ``compare_pvalues(max_log10_diff=0.5)``)

Monte-Carlo p-values from ``roast`` / ``mroast`` / ``romer`` /
``gene_set_test`` agree with R within sampling error.

**Tolerance.** Empirically ~0.3 log10 (factor of 2) between R and
pylimma at ``nrot=999`` on the gene-set-testing fixture data, well
inside the documented ``max_log10_diff=0.5`` threshold.

mrlm stdev_unscaled drifts up to ~15% on machine-epsilon residuals
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

When a gene is perfectly fit by the design (initial OLS residuals at
machine epsilon), ``mrlm`` and R ``MASS::rlm`` produce different
``stdev_unscaled`` values - up to 15% relative difference. The trigger
is exclusive to synthetic perfectly-fit rows; real proteomics or
RNA-seq residuals are 6+ orders of magnitude above machine epsilon
and unaffected.

**Root cause.** R's ``lm.wfit`` uses LINPACK ``DQRDC2`` which produces
mixed-sign machine-epsilon residuals on a degenerate row. scipy's
``np.linalg.lstsq`` uses LAPACK SVD which produces uniform-sign
residuals. The iter-1 MAD scale picks up R's mixed-sign noise pattern
and Huber-downweights three samples as "outliers"; pylimma's MAD
gives unit weights everywhere and returns the unweighted OLS stdev.
Both implementations are computing deterministic numerical noise -
just different noise patterns - because the residuals carry no
information about the underlying linear model.

**Tolerance.** A regression sentinel test
(``tests/rigorous/test_mrlm.py::test_b9c_zero_residual_scale``) is
left as ``xfail`` rather than loosened, so any future numerical
change that aligns the two patterns is detected automatically.

**Note.** Downstream impact is zero: when a row hits this regime
``sigma`` is also at machine epsilon, t-statistics are inf/NaN,
and the empirical-Bayes posterior is dominated by the prior
regardless of which "garbage" stdev was returned.