注意
前往結尾以下載完整的範例程式碼
選擇分類器的適當輸出¶
scikit-learn 分類器通常會傳回機率矩陣。預設情況下,sklearn-onnx 會將該矩陣轉換為字典列表,其中每個機率都對應到其類別 ID 或名稱。這種機制會保留類別名稱,但速度較慢。讓我們看看還有哪些其他選項可用。
訓練模型並轉換它¶
from timeit import repeat
import numpy
import sklearn
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
import onnxruntime as rt
import onnx
import skl2onnx
from skl2onnx.common.data_types import FloatTensorType
from skl2onnx import to_onnx
from sklearn.linear_model import LogisticRegression
from sklearn.multioutput import MultiOutputClassifier
iris = load_iris()
X, y = iris.data, iris.target
X = X.astype(numpy.float32)
y = y * 2 + 10 # to get labels different from [0, 1, 2]
X_train, X_test, y_train, y_test = train_test_split(X, y)
clr = LogisticRegression(max_iter=500)
clr.fit(X_train, y_train)
print(clr)
onx = to_onnx(clr, X_train, target_opset=12)
LogisticRegression(max_iter=500)
預設行為:zipmap=True¶
機率的輸出類型是字典列表。
[{10: 0.9532986879348755, 12: 0.046700991690158844, 14: 2.392355042957206e-07}, {10: 9.972082715137276e-09, 12: 0.0012002107687294483, 14: 0.9987998008728027}]
probabilities type: <class 'list'>
type for the first observations: <class 'dict'>
選項 zipmap=False¶
機率現在是一個矩陣。
initial_type = [("float_input", FloatTensorType([None, 4]))]
options = {id(clr): {"zipmap": False}}
onx2 = to_onnx(clr, X_train, options=options, target_opset=12)
sess2 = rt.InferenceSession(
onx2.SerializeToString(), providers=["CPUExecutionProvider"]
)
res2 = sess2.run(None, {"X": X_test})
print(res2[1][:2])
print("probabilities type:", type(res2[1]))
print("type for the first observations:", type(res2[1][0]))
[[9.5329869e-01 4.6700992e-02 2.3923550e-07]
[9.9720827e-09 1.2002108e-03 9.9879980e-01]]
probabilities type: <class 'numpy.ndarray'>
type for the first observations: <class 'numpy.ndarray'>
選項 zipmap='columns'¶
此選項會移除最終運算符 ZipMap,並將機率分割成多個欄位。最終模型會產生一個用於標籤的輸出,以及每個類別一個輸出。
options = {id(clr): {"zipmap": "columns"}}
onx3 = to_onnx(clr, X_train, options=options, target_opset=12)
sess3 = rt.InferenceSession(
onx3.SerializeToString(), providers=["CPUExecutionProvider"]
)
res3 = sess3.run(None, {"X": X_test})
for i, out in enumerate(sess3.get_outputs()):
print(
"output: '{}' shape={} values={}...".format(
out.name, res3[i].shape, res3[i][:2]
)
)
output: 'output_label' shape=(38,) values=[10 14]...
output: 'i10' shape=(38,) values=[9.532987e-01 9.972083e-09]...
output: 'i12' shape=(38,) values=[0.04670099 0.00120021]...
output: 'i14' shape=(38,) values=[2.392355e-07 9.987998e-01]...
讓我們比較預測時間¶
print("Average time with ZipMap:")
print(sum(repeat(lambda: sess.run(None, {"X": X_test}), number=100, repeat=10)) / 10)
print("Average time without ZipMap:")
print(sum(repeat(lambda: sess2.run(None, {"X": X_test}), number=100, repeat=10)) / 10)
print("Average time without ZipMap but with columns:")
print(sum(repeat(lambda: sess3.run(None, {"X": X_test}), number=100, repeat=10)) / 10)
# The prediction is much faster without ZipMap
# on this example.
# The optimisation is even faster when the classes
# are described with strings and not integers
# as the final result (list of dictionaries) may copy
# many times the same information with onnxruntime.
Average time with ZipMap:
0.006968360000064422
Average time without ZipMap:
0.0026670199999898614
Average time without ZipMap but with columns:
0.004141249999884166
選項 zimpap=False 和 output_class_labels=True¶
選項 zipmap=False 似乎是更好的選擇,因為它速度快得多,但在過程中會遺失標籤。選項 output_class_labels 可用於將標籤顯示為第三個輸出。
initial_type = [("float_input", FloatTensorType([None, 4]))]
options = {id(clr): {"zipmap": False, "output_class_labels": True}}
onx4 = to_onnx(clr, X_train, options=options, target_opset=12)
sess4 = rt.InferenceSession(
onx4.SerializeToString(), providers=["CPUExecutionProvider"]
)
res4 = sess4.run(None, {"X": X_test})
print(res4[1][:2])
print("probabilities type:", type(res4[1]))
print("class labels:", res4[2])
[[9.5329869e-01 4.6700992e-02 2.3923550e-07]
[9.9720827e-09 1.2002108e-03 9.9879980e-01]]
probabilities type: <class 'numpy.ndarray'>
class labels: [10 12 14]
處理時間。
Average time without ZipMap but with output_class_labels:
0.003581729999950767
MultiOutputClassifier¶
此模型相當於多個分類器,每個要預測的標籤各一個。它不是傳回機率矩陣,而是傳回一系列矩陣。讓我們首先修改標籤,以獲得 MultiOutputClassifier 的問題。
[[ 10 1000]
[ 10 110]
[ 10 110]
[ 10 110]
[ 10 110]]
讓我們訓練一個 MultiOutputClassifier。
X_train, X_test, y_train, y_test = train_test_split(X, y)
clr = MultiOutputClassifier(LogisticRegression(max_iter=500))
clr.fit(X_train, y_train)
print(clr)
onx5 = to_onnx(clr, X_train, target_opset=12)
sess5 = rt.InferenceSession(
onx5.SerializeToString(), providers=["CPUExecutionProvider"]
)
res5 = sess5.run(None, {"X": X_test[:3]})
print(res5)
MultiOutputClassifier(estimator=LogisticRegression(max_iter=500))
/home/xadupre/github/sklearn-onnx/skl2onnx/_parse.py:551: UserWarning: Option zipmap is ignored for model <class 'sklearn.multioutput.MultiOutputClassifier'>. Set option zipmap to False to remove this message.
warnings.warn(
[array([[ 14, 114],
[ 12, 112],
[ 12, 112]], dtype=int64), [array([[1.5121835e-04, 1.6296931e-01, 8.3687949e-01],
[7.3818588e-03, 7.9895413e-01, 1.9366404e-01],
[4.2174147e-03, 8.5948825e-01, 1.3629435e-01]], dtype=float32), array([[4.0355229e-04, 1.9043219e-01, 5.6093395e-01, 2.4823032e-01],
[5.2199918e-03, 4.4712257e-01, 1.9140574e-01, 3.5625172e-01],
[3.1568978e-03, 5.7498026e-01, 1.5389088e-01, 2.6797205e-01]],
dtype=float32)]]
選項 zipmap 會被忽略。標籤遺失,但可以將它們加回作為第三個輸出。
onx6 = to_onnx(
clr,
X_train,
target_opset=12,
options={"zipmap": False, "output_class_labels": True},
)
sess6 = rt.InferenceSession(
onx6.SerializeToString(), providers=["CPUExecutionProvider"]
)
res6 = sess6.run(None, {"X": X_test[:3]})
print("predicted labels", res6[0])
print("predicted probabilies", res6[1])
print("class labels", res6[2])
predicted labels [[ 14 114]
[ 12 112]
[ 12 112]]
predicted probabilies [array([[1.5121835e-04, 1.6296931e-01, 8.3687949e-01],
[7.3818588e-03, 7.9895413e-01, 1.9366404e-01],
[4.2174147e-03, 8.5948825e-01, 1.3629435e-01]], dtype=float32), array([[4.0355229e-04, 1.9043219e-01, 5.6093395e-01, 2.4823032e-01],
[5.2199918e-03, 4.4712257e-01, 1.9140574e-01, 3.5625172e-01],
[3.1568978e-03, 5.7498026e-01, 1.5389088e-01, 2.6797205e-01]],
dtype=float32)]
class labels [array([10, 12, 14], dtype=int64), array([ 110, 112, 114, 1000], dtype=int64)]
此範例使用的版本
print("numpy:", numpy.__version__)
print("scikit-learn:", sklearn.__version__)
print("onnx: ", onnx.__version__)
print("onnxruntime: ", rt.__version__)
print("skl2onnx: ", skl2onnx.__version__)
numpy: 1.23.5
scikit-learn: 1.4.dev0
onnx: 1.15.0
onnxruntime: 1.16.0+cu118
skl2onnx: 1.16.0
腳本總執行時間: (0 分鐘 0.467 秒)