注意
前往結尾以下載完整的範例程式碼
轉換 CountVectorizer 或 TfidfVectorizer 時的棘手問題¶
此問題在 scikit-learn/issues/13733 中描述。如果 CountVectorizer 或 TfidfVectorizer 產生一個帶有空格的 token,skl2onnx 無法知道它是雙連詞還是帶有空格的單連詞。
一個無法轉換的簡單範例¶
import pprint
import numpy
from numpy.testing import assert_almost_equal
from onnxruntime import InferenceSession
from sklearn.feature_extraction.text import TfidfVectorizer
from skl2onnx import to_onnx
from skl2onnx.sklapi import TraceableTfidfVectorizer
import skl2onnx.sklapi.register # noqa
corpus = numpy.array(
[
"This is the first document.",
"This document is the second document.",
"Is this the first document?",
"",
]
).reshape((4,))
pattern = r"\b[a-z ]{1,10}\b"
mod1 = TfidfVectorizer(ngram_range=(1, 2), token_pattern=pattern)
mod1.fit(corpus)
單連詞和雙連詞會被放置在以下容器中,該容器將其對應到其欄索引。
{'document': 0,
'document ': 1,
'document is the ': 2,
'is the ': 3,
'is the second ': 4,
'is this ': 5,
'is this the first ': 6,
'second ': 7,
'second document': 8,
'the first ': 9,
'the first document': 10,
'this ': 11,
'this document ': 12,
'this is ': 13,
'this is the first ': 14}
轉換。
try:
to_onnx(mod1, corpus)
except RuntimeError as e:
print(e)
There were ambiguities between n-grams and tokens. 2 errors occurred. You can fix it by using class TraceableTfidfVectorizer.
You can learn more at https://github.com/scikit-learn/scikit-learn/issues/13733.
Unable to split n-grams 'is this the first ' into tokens ('is', 'this', 'the', 'first ') existing in the vocabulary. Token 'is' does not exist in the vocabulary..
Unable to split n-grams 'this is the first ' into tokens ('this', 'is', 'the', 'first ') existing in the vocabulary. Token 'this' does not exist in the vocabulary..
TraceableTfidfVectorizer¶
TraceableTfidfVectorizer
類別等同於 sklearn.feature_extraction.text.TfidfVectorizer
,但會將詞彙的單連詞和雙連詞儲存為元組,而不是將每一部分串連成一個字串。
mod2 = TraceableTfidfVectorizer(ngram_range=(1, 2), token_pattern=pattern)
mod2.fit(corpus)
pprint.pprint(mod2.vocabulary_)
{('document',): 0,
('document ',): 1,
('document ', 'is the '): 2,
('is the ',): 3,
('is the ', 'second '): 4,
('is this ',): 5,
('is this ', 'the first '): 6,
('second ',): 7,
('second ', 'document'): 8,
('the first ',): 9,
('the first ', 'document'): 10,
('this ',): 11,
('this ', 'document '): 12,
('this is ',): 13,
('this is ', 'the first '): 14}
讓我們檢查它是否產生相同的結果。
assert_almost_equal(mod1.transform(corpus).todense(), mod2.transform(corpus).todense())
轉換。已新增程式碼 import skl2onnx.sklapi.register 以註冊與這些新類別相關聯的轉換器。預設情況下,只會宣告 scikit-learn 的轉換器。
讓我們檢查是否有差異...
assert_almost_equal(mod2.transform(corpus).todense(), got[0])
腳本的總執行時間: (0 分鐘 0.037 秒)