關於 ONNX 運算子可微性標籤的簡短指南¶

可微性標籤¶

每個運算子的 ONNX 運算子結構描述都包含每個輸入和輸出的可微性標籤。在本文件中，我們說明此標籤的意義，以及如何確保標籤的正確性。簡而言之，此標籤會識別運算子的可微分輸入和可微分輸出的集合。標籤的意義是，每個可微分輸出的偏導數，相對於每個可微分輸出都有定義。

定義可微性標籤的方法¶

運算子的可微性定義包含數個方面。

可微分輸入，可以在 Gradient 的 xs 屬性中參考。
可微分輸出，可以在 Gradient 的 y 屬性中參考。
計算 Jacobian 矩陣 (或張量) 的數學方程式。變數 (輸入或輸出) 是否可微分，取決於數學。如果 Jacobian 矩陣 (或張量) 存在，則所考慮的運算子會有一些可微分的輸入和輸出。

有數種策略可以實作自動微分，例如正向累加、反向累加和雙變數。由於大多數深度學習架構都基於反向，因此審閱者應確保標籤的 PR 作者提供足夠的詳細資訊。我們在下面提出幾種方法來驗證 ONNX 運算子的可微性。

方法 1：重複使用現有的深度學習架構¶

第一種方法是顯示所考慮的運算子的反向運算存在於現有的架構中，例如 Pytorch 或 Tensorflow。在這種情況下，作者應提供可執行的 python 腳本，該腳本會計算所考慮運算子的反向傳遞。作者也應該指出如何將 Pytorch 或 Tensor 程式碼對應到 ONNX 格式 (例如，作者可以呼叫 torch.onnx.export 來儲存 ONNX 模型)。以下腳本顯示使用 Pytorch 的 ONNX Reshape 的可微性。

import torch
import torch.nn as nn

# A single-operator model. It's literally a Pytorch Reshape.
# Note that Pytorch Reshape can be directly mapped to ONNX Reshape.
class MyModel(nn.Module):
  def __init__(self):
    super(MyModel, self).__init__()

  def forward(self, x):
    y = torch.reshape(x, (x.numel(),))
    y.retain_grad()
    return y

model = MyModel()

x = torch.tensor([[1., -1.], [1., 1.]], requires_grad=True)
y = model(x)
dy = torch.tensor([1., 2., 3., 4.])

torch.autograd.backward([y],
  grad_tensors=[dy],
  retain_graph=True,
  create_graph=True,
  grad_variables=None)

# This example shows the input and the output in Pytorch are differentiable.
# From the exported ONNX model below, we also see that "x" is the first input
# of ONNX Reshape and "y" the output of ONNX Reshape. Therefore, we can say
# the first input and the output of ONNX Reshape are differentiable.
print(x.grad)
print(y.grad)

with open('model.onnx', 'wb') as f:
  torch.onnx.export(model, x, f)

方法 2：手動進行數學計算¶

第二種方法是正式證明從輸出到輸入的 Jacobian 矩陣 (或張量) 存在，並至少提供兩個數值範例。在這種情況下，審閱者應仔細檢查數學並確認數值結果是否正確。作者應新增足夠的詳細資訊，以便任何 STEM 畢業生都可以輕鬆審閱。

例如，為了顯示 Add 的可微性，作者可以先寫下其方程式

C = A + B

為了簡單起見，假設 A 和 B 是相同形狀的向量。

A = [a1, a2]^T
B = [b1, b2]^T
C = [c1, c2]^T

這裡我們使用符號 ^T 來表示附加矩陣或向量的轉置。假設 X = [a1, a2, b1, b2]^T 和 Y = [c1, c2]^T，並將 Add 視為將 X 對應到 Y 的函式。然後，此函式的 Jacobian 矩陣為 4x2 矩陣，

J = [[dc1/da1, dc2/da1],
     [dc1/da2, dc2/da2],
     [dc1/db1, dc2/db1],
     [dc1/db2, dc2/db2]]
  = [[1, 0],
     [0, 1],
     [1, 0],
     [0, 1]]

dL/dC = [dL/dc1, dL/dc2]^T,

然後可以從以下元素計算 dL/dA = [dL/da1, dL/da2]^T 和 dL/dB = [dL/db1, dL/db2]^T

  [[dL/da1], [dL/da2], [dL/db1], [dL/db2]]
= J * dL/dC
= [[dL/dc1], [dL/dc2], [dL/dc1], [dL/dc2]]

其中 * 是標準矩陣乘法。如果 dL/dC = [0.2, 0.8]^T，則 dL/dA = [0.2, 0.8]^T 和 dL/dB = [0.2, 0.8]^T。請注意，從 dL/dC 計算 dL/dA 和 dL/dB 的程序通常稱為運算子的反向。我們可以發現 Add 的反向運算子採用 dL/dC 作為輸入，並產生兩個輸出 dL/dA 和 dL/dB。因此，A、B 和 C 全都可微分。透過將張量扁平化為 1 維向量，此範例可以在不需要形狀廣播時涵蓋所有張量。如果發生廣播，則廣播元素的梯度是其**非廣播**案例中所有相關元素梯度的總和。讓我們再次考慮上述範例。如果 B = [b]^T 成為一個 1 元素向量，則 B 可能會廣播到 [b1, b2]^T 且 dL/dB = [dL/ db]^T = [dL/db1 + dL/db2]^T。對於高維張量，這實際上是沿著所有展開軸的 ReduceSum 運算。