PyTorch and derivatives: A mathematical perspective
Tensor spaces as Hilbert spaces
Derivatives, Jacobians, and gradients
Technically, the type of differentiability I’ve defined here is called Gateaux differentiability, and it is a weaker condition than the more commonly used notion of Fréchet differentiability. Roughly speaking, a function is Fréchet differentiable at a point if it can be well approximated by a linear function in a neighborhood of that point, whereas Gateaux differentiability only requires the existence of directional derivatives. However, for our purposes in this post, we will implicitly assume that all our functions are Fréchet differentiable, in which case the derivative (as defined above) of a differentiable function yields a linear map \(f'(x): \mathcal{H} \to \mathcal{K}\) with \(v \mapsto f'(x)(v)\).
It is always the case in applications that fixed orthonormal bases of the Hilbert spaces \(\mathcal{H}\) and \(\mathcal{K}\) are at hand. If \(\{\epsilon_i\}\) is the fixed basis for \(\mathcal{K}\), and if \(f: \mathcal{H} \to \mathcal{K}\) is a function, then we have real-valued component functions
\[ f_i: \mathcal{H} \to \mathbb{R}, \quad f_i(x) = \langle f(x), \epsilon_i \rangle, \]
one for each basis vector \(\epsilon_i\). One may show that if \(f\) is differentiable, then so too are each of the component functions \(f_i\), and moreover, we have the fundamental link between the derivative of \(f\) and the derivatives of its component functions:
\[ f_i'(x)(v) = \langle f'(x)(v), \epsilon_i \rangle, \tag{1}\]
for all \(x, v \in \mathcal{H}\) and all \(i\). This leads us to the following central definition:
Invoking equation (1), we see that the partial derivative can be equivalently expressed as
\[ \frac{\partial f_i}{\partial e_j}(x) = \left\langle f'(x)(e_j), \epsilon_i \right\rangle. \]
In the case that \(f\) is real-valued, so that \(\mathcal{K} = \mathbb{R}\), we will always take the basis of \(\mathbb{R}\) to be \(\{1\}\), and so the partial derivatives of a real-valued function \(f\) at a point \(x\) with respect to the basis vectors \(\{e_j\}\) of \(\mathcal{H}\) are given by
\[ \frac{\partial f}{\partial e_j}(x) = \langle f'(x)(e_j), 1 \rangle = f'(x)(e_j). \]
The importance of the partial derivatives is that they are the entries in matrix representations of derivatives:
One must take care in interpreting a Jacobian “matrix” as a matrix in the usual way, since the index sets for the bases \(\{e_j\}\) and \(\{\epsilon_i\}\) may be more complicated than just \(\{1, \ldots, n\}\) and \(\{1, \ldots, m\}\), as is the case when \(\mathcal{H}\) and \(\mathcal{K}\) are tensor spaces like \(\mathbb{R}^{m\times n}\) or \(\mathbb{R}^{m\times n \times p}\).
The sense in which the Jacobian matrix \(Jf(x)\) represents the derivative \(f'(x)\) may be described as follows. Suppose given a vector \(v\in \mathcal{H}\), and perform a Fourier expansion relative to the basis \(\{e_j\}\):
\[ v = \sum_j \langle v, e_j \rangle e_j = \sum_j v_j e_j, \]
where \(v_j = \langle v, e_j \rangle\). Then, for each \(j\), we also have the Fourier expansion of \(f'(x)(e_j)\) relative to the basis \(\{\epsilon_i\}\):
\[ f'(x)(e_j) = \sum_i \langle f'(x)(e_j), \epsilon_i \rangle \epsilon_i = \sum_i \frac{\partial f_i}{\partial e_j}(x) \epsilon_i. \]
Thus:
\[ f'(x)(v) = \sum_j v_j f'(x)(e_j) = \sum_i \left[ \sum_j v_j \frac{\partial f_i}{\partial e_j}(x) \right] \epsilon_i. \tag{2}\]
The reader will notice the familiar matrix multiplication pattern in the square brackets. If, in particular, the indices \(i\) and \(j\) range over \(\{1, \ldots, m\}\) and \(\{1, \ldots, n\}\), respectively, then the Jacobian matrix \(Jf(x)\) is an \(m\times n\) matrix, and the expression in the square brackets is just the \(i\)-th entry of the column vector
\[ \begin{bmatrix} \sum_j v_j \frac{\partial f_1}{\partial e_j}(x) \\ \vdots \\ \sum_j v_j \frac{\partial f_m}{\partial e_j}(x) \end{bmatrix}=\begin{bmatrix} \frac{\partial f_1}{\partial e_1}(x) & \cdots & \frac{\partial f_1}{\partial e_n}(x) \\ \vdots & \ddots & \vdots \\ \frac{\partial f_m}{\partial e_1}(x) & \cdots & \frac{\partial f_m}{\partial e_n}(x) \end{bmatrix} \begin{bmatrix} v_1 \\ \vdots \\ v_n \end{bmatrix}. \]
With derivatives and Jacobians in hand, we now move toward defining the gradient, which requires the Riesz Representation Theorem. To state it, however, we need to first introduce the notion of the dual of a (finite-dimensional) Hilbert space. This is the vector space \(\mathcal{H}^*\) of all linear functionals \(\phi: \mathcal{H} \to \mathbb{R}\), where a linear functional is a linear map from \(\mathcal{H}\) to \(\mathbb{R}\). Its vector space structure is given by pointwise addition and scalar multiplication, so that for \(\phi, \psi \in \mathcal{H}^*\) and \(a\in \mathbb{R}\), we have that
\[ (\phi + \psi)(v) = \phi(v) + \psi(v) \]
and
\[(a\phi)(v) = a\phi(v) \]
for all \(v \in \mathcal{H}\). In fact, these definitions work for any vector space, not just Hilbert spaces. But the interest in Hilbert spaces is that they have a built-in mechanism for producing linear functionals from vectors. Indeed, for each \(v \in \mathcal{H}\), we can define a linear functional
\[ v^\ast: \mathcal{H} \to \mathbb{R}, \quad v^\ast(w) = \langle v, w \rangle, \]
which is sometimes called the covector associated to \(v\). Linearity of the inner product in the second argument ensures that the covector is a linear functional, and so \(v^\ast \in \mathcal{H}^*\). Moreover, the mapping \(v \mapsto v^\ast\) is a linear map from \(\mathcal{H}\) to \(\mathcal{H}^*\), in the sense that it preserves vector addition and scalar multiplication.
The Riesz Representation Theorem states that every linear functional on a finite-dimensional Hilbert space arises as an inner product against a unique vector, called its Riesz representation:
Using the Riesz isomorphism from \(\mathcal{H}\) to \(\mathcal{H}^\ast\), we can transport the inner product from \(\mathcal{H}\) to \(\mathcal{H}^*\), and thus equip \(\mathcal{H}^*\) with the structure of a Hilbert space. In particular, for \(\phi, \psi \in \mathcal{H}^*\), we can define their inner product as
\[ \langle \phi, \psi \rangle = \langle v_\phi, v_\psi \rangle. \]
When the dual space \(\mathcal{H}^\ast\) is equipped with this inner product, we call it the dual Hilbert space of \(\mathcal{H}\). The Riesz isomorphism is then an isometric isomorphism between the Hilbert space \(\mathcal{H}\) and its dual Hilbert space \(\mathcal{H}^*\).
We have the mapping \((-)^\ast: \mathcal{H} \to \mathcal{H}^*\) that acts on vectors by sending \(v\) to the covector \(v^\ast\). A similar mapping acts on a linear map \(T: \mathcal{H} \to \mathcal{K}\), where \(\mathcal{H}\) and \(\mathcal{K}\) are finite-dimensional Hilbert spaces. This is the adjoint of \(T\), denoted by \(T^\ast: \mathcal{K} \to \mathcal{H}\), and it is defined as the unique linear map such that
\[ \langle T(v), w \rangle = \langle v, T^\ast(w) \rangle, \]
for all \(v\in \mathcal{H}\) and \(w \in \mathcal{K}\). Notice, in particular, that if we have orthonormal bases \(\{e_j\}\) and \(\{\epsilon_i\}\) for \(\mathcal{H}\) and \(\mathcal{K}\), respectively, then
\[ \langle T(e_j), \epsilon_i \rangle = \langle e_j, T^\ast(\epsilon_i) \rangle, \]
so that the \(ij\)-th entry of the “matrix” representation of \(T\) relative to these bases is equal to the \(ji\)-th entry of the “matrix” representation of \(T^\ast\) relative to the same bases. Again, we must take care in interpreting these “matrices” as matrices in the usual way—just as above when we discussed Jacobian matrices—since the index sets for the bases \(\{e_j\}\) and \(\{\epsilon_i\}\) may be more complicated than just \(\{1, \ldots, n\}\) and \(\{1, \ldots, m\}\).
We now obtain the gradient by smacking the derivative with the Riesz theorem:
Do notice that the gradient is only defined for real-valued functions, not functions with values in a general Hilbert space.
Existence and uniqueness of the gradient follow from the general Riesz representation theorem, though in the present finite-dimensional setting, both can be demonstrated easily by direct construction. Indeed, if \(\{e_j\}\) is an orthonormal basis for \(\mathcal{H}\), and if we assume that \(\nabla f(x)\) exists with the stated property, then for each \(e_j\), we have that
\[ \nabla f(x) = \sum_j \langle \nabla f(x), e_j \rangle e_j = \sum_j f'(x)(e_j) e_j, \]
which shows that the gradient is unique if it exists. To prove that it exists, we can simply define \(\nabla f(x)\) to be the vector given by the right-hand side of the above equation, and then verify that it satisfies the required property. (By the way, this argument is essentially the standard proof of the Riesz representation theorem as stated above.) Furthermore, since \(f'(x)(e_j) = \frac{\partial f}{\partial e_j}(x)\), the Fourier expansion of the gradient relative to the basis \(\{e_j\}\) is given by
\[ \nabla f(x) = \sum_j \frac{\partial f}{\partial e_j}(x) e_j, \]
which should be familiar to anyone who has taken a course in multivariable calculus.
One of the most useful properties of the gradient is that it “points” in the direction of greatest increase of the function. This is made precise in the following result:
The gradient may be obtained through the adjoint action of the derivative, as stated in the following important result:
We work toward ending this long section by studying how the adjoint and derivative operations interact with function composition. Fortunately, both stories could not be simpler. For the adjoint, we have that
\[ (S\circ T)^\ast = T^\ast \circ S^\ast, \]
for linear maps \(\mathcal{H} \xrightarrow{T} \mathcal{K} \xrightarrow{S} \mathcal{L}\) on finite-dimensional Hilbert spaces. The proof uses the uniqueness of the adjoint. For any \(v \in \mathcal{H}\) and \(w \in \mathcal{L}\), we have that
\[ \langle (S\circ T)(v), w \rangle = \langle S(T(v)), w \rangle = \langle T(v), S^\ast(w) \rangle = \langle v, T^\ast(S^\ast(w)) \rangle, \]
so that \((S\circ T)^\ast(w) = T^\ast(S^\ast(w))\), and thus \((S\circ T)^\ast = T^\ast \circ S^\ast\).
For derivatives and composition, we have:
Enter PyTorch: Computational examples of derivatives
Wowza. A hearty slap on the back for ya’, if you endured the previous section and have made it this far. That was all very abstract, very theoretical, and very mathematical (some of you might not view that last adjective as a compliment). But while it’s important for a person to have a solid grasp on the theory, it’s pretty useless if they cannot apply it in practice. In this section, I’ll show you how all that theory can be beautifully illuminated by working through concrete examples using PyTorch.
Let’s consider a differentiable function \[ f: \mathbb{R}^{m\times n} \to \mathbb{R}^{p\times q}. \]
Given two vectors \(x\) and \(v\) in \(\mathbb{R}^{m\times n}\), we can express the vector \(f'(x)(v)\in \mathbb{R}^{p\times q}\) as a Fourier expansion relative to the standard orthonormal basis vectors \(\epsilon_{ij}\in \mathbb{R}^{p\times q}\) identified in the previous section:
\[ f'(x)(v) = \sum_{i,j} \left\langle f'(x)(v), \epsilon_{ij} \right\rangle \epsilon_{ij}. \tag*{$\begin{Bmatrix} 0\leq i < p \\ 0 \leq j < q \end{Bmatrix}$} \]
Similarly, we can write
\[ v = \sum_{k,l} \left\langle v, e_{kl} \right\rangle e_{kl} = \sum_{k,l} v_{kl} e_{kl}, \tag*{$\begin{Bmatrix} 0\leq k < m \\ 0 \leq l < n \end{Bmatrix}$} \]
where \(e_{kl}\) are the standard orthonormal basis vectors of \(\mathbb{R}^{m\times n}\) and \(v_{kl} = \langle v, e_{kl} \rangle\). Substituting this into the expression for \(f'(x)(v)\) and using linearity, we get that:
\[ f'(x)(v) = \sum_{k,l} v_{kl} f'(x)(e_{kl}). \tag*{$\begin{Bmatrix} 0\leq k < m \\ 0 \leq l < n \end{Bmatrix}$} \]
Therefore, we can write
\[ f'(x)(v) = \sum_{i,j} \left[ \sum_{k,l} v_{kl} \left\langle f'(x)(e_{kl}), \epsilon_{ij} \right\rangle \right]\epsilon_{ij}. \tag*{$\begin{Bmatrix} 0\leq i < p \\ 0 \leq j < q \\ 0 \leq k < m \\ 0 \leq l < n \end{Bmatrix}$} \]
But
\[ \left\langle f'(x)(e_{kl}), \epsilon_{ij} \right\rangle = \frac{\partial f_{ij}}{\partial x_{kl}}(x), \]
and so this last equation becomes
\[ f'(x)(v) = \sum_{i,j} \left[ \sum_{k,l} v_{kl} \frac{\partial f_{ij}}{\partial x_{kl}}(x) \right]\epsilon_{ij}. \tag*{$\begin{Bmatrix} 0\leq i < p \\ 0 \leq j < q \\ 0 \leq k < m \\ 0 \leq l < n \end{Bmatrix}$} \]