除錯數值錯誤

當 onnx-mlir 編譯的推論執行檔產生與訓練框架產生不一致的數值結果時，請使用 utils/RunONNXModel.py Python 腳本來除錯數值錯誤。此 Python 腳本將透過 onnx-mlir 和參考後端執行模型，並逐層比較這兩個後端產生的中間結果。

先決條件

設定 ONNX_MLIR_HOME 環境變數為 onnx-mlir 的 HOME 目錄路徑。onnx-mlir 的 HOME 目錄指的是包含 bin、lib 等子資料夾的父資料夾，其中可以找到 ONNX-MLIR 執行檔和函式庫。

參考後端

onnx-mlir 的輸出可以使用參考 ONNX 後端或 protobuf 中的參考輸入和輸出進行驗證。

若要使用參考後端進行驗證，請執行 pip install onnxruntime 來安裝 onnxruntime。若要使用不同的測試後端，只需將匯入 onnxruntime 的程式碼替換為其他符合 ONNX 的後端即可。
若要使用參考輸出進行驗證，請使用 --verify=ref --load-ref=data_folder，其中 data_folder 是包含輸入和輸出的 protobuf 檔案的資料夾路徑。此指南說明如何從 numpy 陣列建立 protobuf 檔案。

用法

utils/RunONNXModel.py 支援以下命令列選項

$ python ../utils/RunONNXModel.py  --help
usage: RunONNXModel.py [-h] [--log-to-file [LOG_TO_FILE]] [--model MODEL] [--compile-args COMPILE_ARGS] [--compile-only] [--compile-using-input-shape] [--print-input]
                       [--print-output] [--save-onnx PATH] [--verify {onnxruntime,ref}] [--verify-all-ops] [--verify-with-softmax] [--verify-every-value] [--rtol RTOL]
                       [--atol ATOL] [--save-so PATH | --load-so PATH] [--save-ref PATH] [--load-ref PATH | --shape-info SHAPE_INFO] [--lower-bound LOWER_BOUND]
                       [--upper-bound UPPER_BOUND]

optional arguments:
  -h, --help                  show this help message and exit
  --log-to-file [LOG_TO_FILE] Output compilation messages to file, default compilation.log
  --model MODEL               Path to an ONNX model (.onnx or .mlir)
  --compile-args COMPILE_ARGS Arguments passed directly to onnx-mlir command. See bin/onnx-mlir --help
  --compile-only              Only compile the input model
  --compile-using-input-shape Compile the model by using the shape info getting from the inputs in the reference folder set by --load-ref
  --print-input               Print out inputs
  --print-output              Print out inference outputs produced by onnx-mlir
  --save-onnx PATH            File path to save the onnx model. Only effective if --verify=onnxruntime
  --verify {onnxruntime,ref}  Verify the output by using onnxruntime or reference inputs/outputs. By default, no verification. When being enabled, --verify-with-softmax or --verify-every-value must be used to specify verification mode.
  --verify-all-ops            Verify all operation outputs when using onnxruntime
  --verify-with-softmax       Verify the result obtained by applying softmax to the output
  --verify-every-value        Verify every value of the output using atol and rtol
  --rtol RTOL                 Relative tolerance for verification
  --atol ATOL                 Absolute tolerance for verification
  --save-so PATH              File path to save the generated shared library of the model
  --load-so PATH              File path to load a generated shared library for inference, and the ONNX model will not be re-compiled
  --save-ref PATH             Path to a folder to save the inputs and outputs in protobuf
  --load-ref PATH             Path to a folder containing reference inputs and outputs stored in protobuf. If --verify=ref, inputs and outputs are reference data for verification
  --shape-info SHAPE_INFO     Shape for each dynamic input of the model, e.g. 0:1x10x20,1:7x5x3. Used to generate random inputs for the model if --load-ref is not set
  --lower-bound LOWER_BOUND   Lower bound values for each data type. Used inputs. E.g. --lower-bound=int64:-10,float32:-0.2,uint8:1. Supported types are bool, uint8, int8, uint16, int16, uint32, int32, uint64, int64,float16, float32, float64
  --upper-bound UPPER_BOUND   Upper bound values for each data type. Used to generate random inputs. E.g. --upper-bound=int64:10,float32:0.2,uint8:9. Supported types are bool, uint8, int8, uint16, int16, uint32, int32, uint64, int64, float16, float32, float64

比較在兩種不同編譯選項下模型的輔助腳本。

基於上述 utils/runONNXModel.py，utils/checkONNXModel.py 允許使用者在兩種不同的編譯選項下執行給定的模型兩次，並比較其結果。這讓使用者可以簡單地測試一個新選項，比較編譯器的安全版本（例如 -O0 或 -O3）與更進階的版本（例如 -O3 或 -O3 -march=x86-64）。只需使用 --ref-compile-args 和 --test-compile-args 標誌指定編譯選項，使用 --model 標誌指定模型，並在存在動態形狀輸入的情況下指定 --shape-info。完整選項列在 --help 標誌下。

除錯為運算子產生的程式碼。

如果您知道或懷疑特定的 ONNX MLIR 運算子產生不正確的結果，並想縮小問題範圍，我們提供幾個有用的 Krnl 運算子，允許列印（在執行階段）張量的值或具有原始資料類型的值。

若要在特定的程式點列印出張量的值，請注入以下程式碼（其中 X 是要列印的張量）

create.krnl.printTensor("Tensor X: ", X);

注意：目前僅當張量的秩小於四時才會列印張量的內容。

若要列印訊息和一個值，請注入以下程式碼（其中 val 是要列印的值，而 valType 是其類型）

create.krnl.printf("inputElem: ", val, valType);

尋找記憶體錯誤

如果您知道或懷疑 onnx-mlir 編譯的推論執行檔存在與記憶體配置相關的問題，可以使用 valgrind 框架或 mtrace 記憶體工具來協助除錯。這些工具追蹤記憶體配置/釋放相關的 API，並且可以偵測記憶體問題，例如記憶體洩漏。

但是，如果問題與記憶體存取相關，特別是緩衝區溢位問題，則很難除錯，因為執行階段錯誤發生在包含問題的點之外。「Electric Fence 函式庫」可以用於除錯這些問題。它可以協助您偵測兩個常見的程式設計問題：軟體溢出了 malloc() 記憶體配置的邊界，以及軟體存取了已被 free() 釋放的記憶體配置。與其他記憶體除錯器不同，Electric Fence 會偵測讀取存取以及寫入，並且會精確指出導致錯誤的指令。

由於 RedHat 並未正式支援 Electric Fence 函式庫，因此您需要自行下載、建置和安裝原始碼。安裝後，在產生推論執行檔時使用「-lefence」選項連結此函式庫。然後只需執行它，這會導致執行階段錯誤並停在導致記憶體存取問題的地方。您可以使用除錯器或前一節中描述的除錯列印函式來識別該位置。

onnx-mlir

操作指南

參考資料

開發

工具

工具