You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This PR description was mostly written by AI. I reviewed it before submission.
Reference Issues/PRs
This will make #498 very easy, like a 4 lines change. #498 is not a bug, but I'd say it's still a big friction for mlflow users and probably doesn't help them adopting skops.
What does this implement/fix? Explain your changes.
This PR adds support for persisting/loading the internal scikit-learn types needed by GradientBoosting and HistGradientBoosting models without surfacing them as untrusted types.
The implementation introduces a sklearn-specific internal-object path in skops/io/_sklearn.py for a small allowlist of trusted sklearn internals used by GB/HGB models.
CyHalfMultinomialLoss is serialized under the qualified module path sklearn._loss._loss.CyHalfMultinomialLoss rather than _loss.CyHalfMultinomialLoss so that trust is anchored under sklearn.. Maybe this is overkilled?
The PR also adds targeted regression tests in skops/io/tests/test_persist.py covering GradientBoosting / HistGradientBoosting variants and checking that CyHalfMultinomialLoss is serialized under the qualified sklearn module path.
AI usage disclosure:
The code changes were developed with AI assistance in an interactive back-and-forth session.
I asked the assistant to probe which trusted types were needed for GradientBoosting/HistGradientBoosting, explain the relevant serialization/trust code paths, review security tradeoffs, and iterate on the implementation based on my feedback.
I reviewed the proposed changes, asked follow-up questions about the design and security properties, and requested specific adjustments such as using a qualified sklearn path for CyHalfMultinomialLoss and tightening/cleaning the sklearn-internal trust path.
I asked Codex to audit added scikit-learn classes, here is its report (TLDR: those are all simple, numeric-oriented classes, and are safe to trust).
Details
Per-class notes
Stateless numeric wrappers
These classes are thin mathematical wrappers with no meaningful mutable state and no custom deserialization hooks. In practice they only expose NumPy / SciPy computations, so trusting them does not add an obvious code-execution surface.
Safe: stateless numeric wrappers only; no eval / exec, no dynamic import, no custom __setstate__, no attacker-controlled callable invocation on load.
Small passive data containers
These objects mainly carry parameters or bounds. The main caveat is that skops restores them through __new__ plus attribute injection, so constructor-time validation is bypassed.
Classes
What they do
Safety review
sklearn._loss.link.Interval
Dataclass storing lower/upper bounds and inclusiveness flags for valid value ranges.
Safe from code execution: only scalar fields and range checks; malformed values could violate invariants, but there is no code-exec hook.
Python loss wrappers around trusted numeric backends
These classes wrap Cython loss kernels plus link objects and a few scalar flags. They do not define custom deserialization logic; in the skops path they are rebuilt by restoring plain attributes (closs, link, flags, intervals, sometimes quantile / n_classes).
Regression / classification loss objects used by GB / HGB and related estimators.
Safe: attribute-only restore pattern, no custom __setstate__, no dynamic execution logic; malformed trusted state could still cause wrong predictions or runtime errors because constructor validation is not re-run.
Cython backend object
Classes
What they do
Safety review
sklearn._loss._loss.CyHalfMultinomialLoss
Cython implementation of multiclass loss / gradient / probability kernels.
Safe in the skops path: effectively stateless here (__getstate__ is None), and skops bypasses Cython's pickle helper by reconstructing the instance via __new__, so there is no attacker-controlled function call during load.
HistGradientBoosting internals with explicit state restoration
These are the only reviewed classes with meaningful state-restoration behavior. I checked those methods directly.
Stores HGB binning metadata and maps raw features to integer bins.
Safe: deserialization goes through BaseEstimator.__setstate__, which only pops _sklearn_version, warns on version mismatch, then updates __dict__; no dynamic execution.
Stores HGB tree arrays and runs fast prediction / partial dependence kernels.
Safe: custom __setstate__ only restores state and casts nodes to PREDICTOR_RECORD_DTYPE for cross-bitness compatibility; no import/eval/callback behavior.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR description was mostly written by AI. I reviewed it before submission.
Reference Issues/PRs
This will make #498 very easy, like a 4 lines change. #498 is not a bug, but I'd say it's still a big friction for mlflow users and probably doesn't help them adopting
skops.What does this implement/fix? Explain your changes.
This PR adds support for persisting/loading the internal scikit-learn types needed by GradientBoosting and HistGradientBoosting models without surfacing them as untrusted types.
The implementation introduces a sklearn-specific internal-object path in
skops/io/_sklearn.pyfor a small allowlist of trusted sklearn internals used by GB/HGB models.CyHalfMultinomialLossis serialized under the qualified module pathsklearn._loss._loss.CyHalfMultinomialLossrather than_loss.CyHalfMultinomialLossso that trust is anchored undersklearn.. Maybe this is overkilled?The PR also adds targeted regression tests in
skops/io/tests/test_persist.pycovering GradientBoosting / HistGradientBoosting variants and checking thatCyHalfMultinomialLossis serialized under the qualified sklearn module path.AI usage disclosure:
CyHalfMultinomialLossand tightening/cleaning the sklearn-internal trust path.