一個模型,多種轉換選項

轉換模型並非只有一種方法。較新版本的 ONNX 中可能新增了新的運算子,這會加速轉換後的模型。合理的選擇是使用這個新的運算子,但這表示相關的執行階段對此有一個實作。如果兩個不同的使用者需要對同一個模型進行兩種不同的轉換怎麼辦?讓我們看看如何做到這一點。

選項 zipmap

每個分類器在設計上都會轉換為一個 ONNX 圖表,該圖表會輸出兩個結果:預測標籤和每個標籤的預測機率。預設情況下,標籤是整數,機率會儲存在字典中。這就是在以下圖表末尾新增運算子 *ZipMap* 的目的。

    graph ONNX(LogisticRegression) (
      %X[FLOAT, ?x4]
    ) {
      %label, %probability_tensor = LinearClassifier[classlabels_ints = [0, 1, 2], coefficients = [-0.374590873718262, 0.882017612457275, -2.25903177261353, -0.96484386920929, 0.463038802146912, -0.698963463306427, -0.0836651995778084, -0.888288736343384, -0.0884479135274887, -0.18305416405201, 2.34269690513611, 1.85313260555267], intercepts = [8.58371162414551, 2.95640826225281, -11.5401201248169], multi_class = 1, post_transform = 'SOFTMAX'](%X)
      %output_label = Cast[to = 7](%label)
      %probabilities = Normalizer[norm = 'L1'](%probability_tensor)
      %output_probability = ZipMap[classlabels_int64s = [0, 1, 2]](%probabilities)
      return %output_label, %output_probability
    }

這個運算子效率不高,因為它會將每個機率和標籤複製到不同的容器中。對於小型分類器來說,這個時間通常很重要。因此,移除它是有意義的。

    graph ONNX(LogisticRegression) (
      %X[FLOAT, ?x4]
    ) {
      %label, %probability_tensor = LinearClassifier[classlabels_ints = [0, 1, 2], coefficients = [-0.374590873718262, 0.882017612457275, -2.25903177261353, -0.96484386920929, 0.463038802146912, -0.698963463306427, -0.0836651995778084, -0.888288736343384, -0.0884479135274887, -0.18305416405201, 2.34269690513611, 1.85313260555267], intercepts = [8.58371162414551, 2.95640826225281, -11.5401201248169], multi_class = 1, post_transform = 'SOFTMAX'](%X)
      %probabilities = Normalizer[norm = 'L1'](%probability_tensor)
      return %label, %probabilities
    }

圖表中可能有很多分類器,所以必須有一種方法來指定哪個分類器應保留其 *ZipMap*,哪個不保留。因此,可以按 ID 指定選項。

from pprint import pformat
import numpy
from onnx.reference import ReferenceEvaluator
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import MinMaxScaler
from sklearn.pipeline import Pipeline
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from skl2onnx.common._registration import _converter_pool
from skl2onnx import to_onnx
from onnxruntime import InferenceSession

iris = load_iris()
X, y = iris.data, iris.target
X_train, X_test, y_train, _ = train_test_split(X, y, random_state=11)
clr = LogisticRegression()
clr.fit(X_train, y_train)

model_def = to_onnx(
    clr, X_train.astype(numpy.float32), options={id(clr): {"zipmap": False}}
)
oinf = ReferenceEvaluator(model_def)
print(oinf)
/home/xadupre/github/scikit-learn/sklearn/linear_model/_logistic.py:472: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.dev.org.tw/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.dev.org.tw/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
ReferenceEvaluator(X) -> label, probabilities

使用函式 *id* 有一個缺點:它無法被 pickle。最好使用字串。

model_def = to_onnx(clr, X_train.astype(numpy.float32), options={"zipmap": False})
oinf = ReferenceEvaluator(model_def)
print(oinf)
ReferenceEvaluator(X) -> label, probabilities

管線中的選項

在管線中,sklearn-onnx 使用相同的命名慣例。

pipe = Pipeline([("norm", MinMaxScaler()), ("clr", LogisticRegression())])
pipe.fit(X_train, y_train)

model_def = to_onnx(pipe, X_train.astype(numpy.float32), options={"clr__zipmap": False})
oinf = ReferenceEvaluator(model_def)
print(oinf)
ReferenceEvaluator(X) -> label, probabilities

選項 raw_scores

每個分類器都會轉換為一個預設會傳回機率的圖表。但許多模型會計算未縮放的 *raw_scores*。首先,使用機率

pipe = Pipeline([("norm", MinMaxScaler()), ("clr", LogisticRegression())])
pipe.fit(X_train, y_train)

model_def = to_onnx(
    pipe, X_train.astype(numpy.float32), options={id(pipe): {"zipmap": False}}
)

oinf = ReferenceEvaluator(model_def)
print(oinf.run(None, {"X": X.astype(numpy.float32)[:5]}))
[array([0, 0, 0, 0, 0]), array([[0.88268626, 0.10948393, 0.00782984],
       [0.7944385 , 0.19728662, 0.00827491],
       [0.85557765, 0.13792053, 0.00650185],
       [0.8262804 , 0.16634221, 0.00737737],
       [0.90050155, 0.092388  , 0.00711049]], dtype=float32)]

然後使用原始分數

model_def = to_onnx(
    pipe,
    X_train.astype(numpy.float32),
    options={id(pipe): {"raw_scores": True, "zipmap": False}},
)

oinf = ReferenceEvaluator(model_def)
print(oinf.run(None, {"X": X.astype(numpy.float32)[:5]}))
[array([0, 0, 0, 0, 0]), array([[0.88268626, 0.10948393, 0.00782984],
       [0.7944385 , 0.19728662, 0.00827491],
       [0.85557765, 0.13792053, 0.00650185],
       [0.8262804 , 0.16634221, 0.00737737],
       [0.90050155, 0.092388  , 0.00711049]], dtype=float32)]

它似乎沒有用... 我們需要說明這適用於管線的特定部分,而不是整個管線。

model_def = to_onnx(
    pipe,
    X_train.astype(numpy.float32),
    options={id(pipe.steps[1][1]): {"raw_scores": True, "zipmap": False}},
)

oinf = ReferenceEvaluator(model_def)
print(oinf.run(None, {"X": X.astype(numpy.float32)[:5]}))
[array([0, 0, 0, 0, 0]), array([[ 2.2707398 ,  0.18354762, -2.4542873 ],
       [ 1.9857951 ,  0.5928172 , -2.5786123 ],
       [ 2.2349296 ,  0.4098304 , -2.6447601 ],
       [ 2.1071343 ,  0.5042473 , -2.6113818 ],
       [ 2.3727787 ,  0.095824  , -2.4686027 ]], dtype=float32)]

有負值。這有效。字串仍然更容易使用。

model_def = to_onnx(
    pipe,
    X_train.astype(numpy.float32),
    options={"clr__raw_scores": True, "clr__zipmap": False},
)

oinf = ReferenceEvaluator(model_def)
print(oinf.run(None, {"X": X.astype(numpy.float32)[:5]}))
[array([0, 0, 0, 0, 0]), array([[ 2.2707398 ,  0.18354762, -2.4542873 ],
       [ 1.9857951 ,  0.5928172 , -2.5786123 ],
       [ 2.2349296 ,  0.4098304 , -2.6447601 ],
       [ 2.1071343 ,  0.5042473 , -2.6113818 ],
       [ 2.3727787 ,  0.095824  , -2.4686027 ]], dtype=float32)]

負數。我們仍然有原始分數。

選項 decision_path

*scikit-learn* 實作了一個函式來擷取決策路徑。可以透過選項 *decision_path* 啟用它。

clrrf = RandomForestClassifier(n_estimators=2, max_depth=2)
clrrf.fit(X_train, y_train)
clrrf.predict(X_test[:2])
paths, n_nodes_ptr = clrrf.decision_path(X_test[:2])
print(paths.todense())

model_def = to_onnx(
    clrrf,
    X_train.astype(numpy.float32),
    options={id(clrrf): {"decision_path": True, "zipmap": False}},
)
sess = InferenceSession(
    model_def.SerializeToString(), providers=["CPUExecutionProvider"]
)
[[1 0 0 0 1 0 1 1 1 0 1 0 0 0]
 [1 0 0 0 1 0 1 1 1 0 1 0 0 0]]

模型產生 3 個輸出。

print([o.name for o in sess.get_outputs()])
['label', 'probabilities', 'decision_path']

讓我們顯示最後一個。

res = sess.run(None, {"X": X_test[:2].astype(numpy.float32)})
print(res[-1])
[['1000101' '1101000']
 ['1000101' '1101000']]

可用選項清單

為每個轉換器註冊選項,以便在執行轉換時偵測任何支援的選項。

all_opts = set()
for k, v in sorted(_converter_pool.items()):
    opts = v.get_allowed_options()
    if not isinstance(opts, dict):
        continue
    name = k.replace("Sklearn", "")
    print("%s%s %r" % (name, " " * (30 - len(name)), opts))
    for o in opts:
        all_opts.add(o)

print("all options:", pformat(list(sorted(all_opts))))
AdaBoostClassifier             {'zipmap': [True, False, 'columns'], 'nocl': [True, False], 'output_class_labels': [False, True], 'raw_scores': [True, False]}
BaggingClassifier              {'zipmap': [True, False, 'columns'], 'nocl': [True, False], 'output_class_labels': [False, True], 'raw_scores': [True, False]}
BayesianGaussianMixture        {'score_samples': [True, False]}
BayesianRidge                  {'return_std': [True, False]}
BernoulliNB                    {'zipmap': [True, False, 'columns'], 'output_class_labels': [False, True], 'nocl': [True, False]}
CalibratedClassifierCV         {'zipmap': [True, False, 'columns'], 'output_class_labels': [False, True], 'nocl': [True, False]}
CategoricalNB                  {'zipmap': [True, False, 'columns'], 'output_class_labels': [False, True], 'nocl': [True, False]}
ComplementNB                   {'zipmap': [True, False, 'columns'], 'output_class_labels': [False, True], 'nocl': [True, False]}
CountVectorizer                {'tokenexp': None, 'separators': None, 'nan': [True, False], 'keep_empty_string': [True, False], 'locale': None}
DecisionTreeClassifier         {'zipmap': [True, False, 'columns'], 'nocl': [True, False], 'output_class_labels': [False, True], 'decision_path': [True, False], 'decision_leaf': [True, False]}
DecisionTreeRegressor          {'decision_path': [True, False], 'decision_leaf': [True, False]}
ExtraTreeClassifier            {'zipmap': [True, False, 'columns'], 'nocl': [True, False], 'output_class_labels': [False, True], 'decision_path': [True, False], 'decision_leaf': [True, False]}
ExtraTreeRegressor             {'decision_path': [True, False], 'decision_leaf': [True, False]}
ExtraTreesClassifier           {'zipmap': [True, False, 'columns'], 'raw_scores': [True, False], 'nocl': [True, False], 'output_class_labels': [False, True], 'decision_path': [True, False], 'decision_leaf': [True, False]}
ExtraTreesRegressor            {'decision_path': [True, False], 'decision_leaf': [True, False]}
GaussianMixture                {'score_samples': [True, False]}
GaussianNB                     {'zipmap': [True, False, 'columns'], 'output_class_labels': [False, True], 'nocl': [True, False]}
GaussianProcessClassifier      {'optim': [None, 'cdist'], 'nocl': [False, True], 'output_class_labels': [False, True], 'zipmap': [False, True]}
GaussianProcessRegressor       {'return_cov': [False, True], 'return_std': [False, True], 'optim': [None, 'cdist']}
GradientBoostingClassifier     {'zipmap': [True, False, 'columns'], 'raw_scores': [True, False], 'output_class_labels': [False, True], 'nocl': [True, False]}
HistGradientBoostingClassifier {'zipmap': [True, False, 'columns'], 'raw_scores': [True, False], 'output_class_labels': [False, True], 'nocl': [True, False]}
HistGradientBoostingRegressor  {'zipmap': [True, False, 'columns'], 'raw_scores': [True, False], 'output_class_labels': [False, True], 'nocl': [True, False]}
IsolationForest                {'score_samples': [True, False]}
KMeans                         {'gemm': [True, False]}
KNNImputer                     {'optim': [None, 'cdist']}
KNeighborsClassifier           {'zipmap': [True, False, 'columns'], 'nocl': [True, False], 'raw_scores': [True, False], 'output_class_labels': [False, True], 'optim': [None, 'cdist']}
KNeighborsRegressor            {'optim': [None, 'cdist']}
KNeighborsTransformer          {'optim': [None, 'cdist']}
KernelPCA                      {'optim': [None, 'cdist']}
LinearClassifier               {'zipmap': [True, False, 'columns'], 'nocl': [True, False], 'output_class_labels': [False, True], 'raw_scores': [True, False]}
LinearSVC                      {'nocl': [True, False], 'output_class_labels': [False, True], 'raw_scores': [True, False]}
LocalOutlierFactor             {'score_samples': [True, False], 'optim': [None, 'cdist']}
MLPClassifier                  {'zipmap': [True, False, 'columns'], 'output_class_labels': [False, True], 'nocl': [True, False]}
MaxAbsScaler                   {'div': ['std', 'div', 'div_cast']}
MiniBatchKMeans                {'gemm': [True, False]}
MultiOutputClassifier          {'nocl': [False, True], 'output_class_labels': [False, True], 'zipmap': [False, True]}
MultinomialNB                  {'zipmap': [True, False, 'columns'], 'output_class_labels': [False, True], 'nocl': [True, False]}
NearestNeighbors               {'optim': [None, 'cdist']}
OneVsOneClassifier             {'zipmap': [True, False, 'columns'], 'nocl': [True, False], 'output_class_labels': [False, True]}
OneVsRestClassifier            {'zipmap': [True, False, 'columns'], 'nocl': [True, False], 'output_class_labels': [False, True], 'raw_scores': [True, False]}
QuadraticDiscriminantAnalysis  {'zipmap': [True, False, 'columns'], 'nocl': [True, False], 'output_class_labels': [False, True]}
RadiusNeighborsClassifier      {'zipmap': [True, False, 'columns'], 'nocl': [True, False], 'raw_scores': [True, False], 'output_class_labels': [False, True], 'optim': [None, 'cdist']}
RadiusNeighborsRegressor       {'optim': [None, 'cdist']}
RandomForestClassifier         {'zipmap': [True, False, 'columns'], 'raw_scores': [True, False], 'nocl': [True, False], 'output_class_labels': [False, True], 'decision_path': [True, False], 'decision_leaf': [True, False]}
RandomForestRegressor          {'decision_path': [True, False], 'decision_leaf': [True, False]}
RobustScaler                   {'div': ['std', 'div', 'div_cast']}
SGDClassifier                  {'zipmap': [True, False, 'columns'], 'nocl': [True, False], 'output_class_labels': [False, True], 'raw_scores': [True, False]}
SVC                            {'zipmap': [True, False, 'columns'], 'nocl': [True, False], 'output_class_labels': [False, True], 'raw_scores': [True, False]}
Scaler                         {'div': ['std', 'div', 'div_cast']}
StackingClassifier             {'zipmap': [True, False, 'columns'], 'nocl': [True, False], 'output_class_labels': [False, True], 'raw_scores': [True, False]}
TfidfTransformer               {'nan': [True, False]}
TfidfVectorizer                {'tokenexp': None, 'separators': None, 'nan': [True, False], 'keep_empty_string': [True, False], 'locale': None}
VotingClassifier               {'zipmap': [True, False, 'columns'], 'output_class_labels': [False, True], 'nocl': [True, False]}
_ConstantPredictor             {'zipmap': [True, False, 'columns'], 'nocl': [True, False]}
all options: ['decision_leaf',
 'decision_path',
 'div',
 'gemm',
 'keep_empty_string',
 'locale',
 'nan',
 'nocl',
 'optim',
 'output_class_labels',
 'raw_scores',
 'return_cov',
 'return_std',
 'score_samples',
 'separators',
 'tokenexp',
 'zipmap']

腳本的總執行時間:(0 分鐘 0.116 秒)

由 Sphinx-Gallery 產生的展示