Expressivity over Function Spaces
Having understood the fundamental theory of deep neural networks from the previous chapters, we now ask the most important question of
Why should neural networks work?
This is a difficult question to answer from a practical perspective. We can certainly point to abundant empirical evidence from real-world applications to believe that deep learning models are successfully used in a wide range of domains such as image recognition, natural language processing, and so on. However, to provide a more concrete and theoretical answer to the above question, we turn to several remarkable mathematical results concerning the function approximation capacity of neural networks. In this chapter, we review a few of these results.
A neural network can be effectively used in two major ways. One is for pattern recognition, which essentially corresponds to a classification problem as illustrated often in our previous chapters. Another important application of deep learning is to use it as a function approximation tool, in particular in regression, where the goal is to approximate an underlying functional relationship between inputs and outputs. In fact, classification problems can also be viewed as function approximation as we are effectively approximating a function that maps the input vectors to discrete output labels.
In recent years, deep learning has also been used to construct approximate solutions to differential equations by incorporating the underlying physics into the cost function through residual errors. Such networks are known as physics informed neural networks (PINN). PINNs provide a natural conceptual bridge from approximation theory to scientific machine learning.
In this chapter, we review few basic results that demonstrate the approximation capacity of neural networks. In Section «Click Here», we shall see that even a simple feedforward network with a single hidden layer can approximate, to any desired accuracy, any continuous function defined on a compact domain. This remarkable property is known as the universal approximation property.
Function Approximations
Approximating a nonlinear function is challenging because different types of nonlinearities require different function classes for effective approximation. The ultimate requirement in approximation theory is to identify a suitable function class within which we can find an approximation to any given nonlinear function to a desired accuracy. Such a property is referred to as universal approximation. Deep neural networks are known to be good universal approximators, in the sense that, by suitably choosing their architecture, one can achieve any desired accuracy for a wide class of target functions.
Let \(\mathcal{X}\) denote the input space and \(\mathcal{Y}\) the target space.
Define a function class
A hypothesis class is a subclass \(\mathcal{H}\subseteq \mathbb{F}\) from which we seek an approximation to a given target function \(f^*\in \mathbb{F}\).
In what follows, we consider the function class \(\mathbb{F}\) as a linear space equipped with a norm, thereby making it a normed linear space. In this case, we call the normed linear space a function space, especially if it is a Banach space or a Hilbert space.
Continuous Functions:
is a function space with the uniform norm (or supremum norm) defined as
where \(\|\cdot\|\) on the right hand side denotes a vector norm on \(\mathbb{R}^m\), which is often taken as the Euclidean norm. One may also take the \(\ell^1\)-norm or the maximum norm.
We use the notation \(\mathbb{C}(\mathcal{X}, \mathcal{Y})\) to denote the space of continuous functions. In particular, if \(\mathcal{Y}=\mathbb{R}\), then we use the notation \(\mathbb{C}(\mathcal{X})\).
Since \(\mathcal{X}\) is compact, \((\mathbb{C}(\mathcal{X}, \mathcal{Y}), \|\cdot\|_{\infty,\mathcal{X}})\) is a Banach space.
Continuously Differentiable Functions:
is a function space with the norm defined as
where \(\|\cdot\|\) on the right hand side is a vector norm on \(\mathbb{R}^m\), \(|\boldsymbol{\alpha}|=\alpha_1+\ldots+\alpha_n,\) and \(\mathbb{N}_0^n = (\mathbb{N}\cup \{0\})^n\) is the multi-index set.
We use the notation \(\mathbb{C}^k(\mathcal{X}, \mathcal{Y})\) to denote the space of \(k\)-times continuously differentiable functions and is a Banach space since \(\mathcal{X}\) is compact.
We use the notation \(\mathbb{C}^k(\mathcal{X})\), if \(\mathcal{Y}=\mathbb{R}\).
\(L^p\) Functions:
where \(\|\cdot\|\) denotes a vector norm on \(\mathbb{R}^m\), often taken as the Euclidean norm.
The corresponding norm on this space is defined as
For \(p = \infty\), we define
where \(\operatorname*{ess\,sup}\) denotes the essential supremum with respect to the measure \(\mu\).
The space of such functions is denoted by \(\mathbb{L}^p(\mathcal{X}, \mathcal{Y})\), and is called the \(L^p\)-space. The \(L^p\) spaces are Banach spaces, and in the special case \(p=2\), \(\mathbb{L}^2(\mathcal{X}, \mathcal{Y})\) is a Hilbert space with the inner product
where \(\langle \cdot, \cdot \rangle\) is the Euclidean inner product on \(\mathbb{R}^m\).
It is common to use the notation \(\mathbb{L}^p(\mathcal{X})\) if \( \mathcal{Y}=\mathbb{R}\).
Sobolev Functions:
The corresponding norm is defined as
and for \(p = \infty\),
The space of such functions is denoted by
and is called the Sobolev space.
The Sobolev spaces \(\mathbb{W}^{k,p}(\mathcal{X}, \mathcal{Y})\) are Banach spaces. For the special case \(p=2,\) we use the notation \(\mathbb{H}^k(\mathcal{X},\mathcal{Y}) := \mathbb{W}^{k,2}(\mathcal{X},\mathcal{Y})\), which is a Hilbert space with the inner product
Bounded Variation (BV) Functions:
The space of functions of bounded variation is denoted by
where the BV-norm is defined as
and \(|Df|(\mathcal{X})\) denotes the total variation of the derivative measure \(Df\) over \(\mathcal{X}\).
The space \(\mathbb{BV}(\mathcal{X})\) is a Banach space with respect to the norm \(\|\cdot\|_{\mathbb{BV}}\).
Often, solutions of mathematical problems are functions, and it is necessary to look for the solution in an appropriate function space. For many problems, we may know that a solution exists in a given function space, but its explicit form is either unknown or too complicated to obtain. In such situations, we look for an approximation to the solution in the function space. However, the full function space is often too large that obtaining an approximation within this space becomes impossible, and we have to restrict to a more tractable subset, a hypothesis set. To ensure that this hypothesis set is sufficient to approximate any solution to a desired accuracy, it must be dense in the function space.
A subset \(\mathcal{H}\) of a normed linear space \(\mathbb{F}\) is said to be dense in \(\mathbb{F}\) if for every \(f\in \mathbb{F}\), and every \(\varepsilon>0\), there exists \(h\in \mathcal{H}\) such that
We now introduce a notation for the set of all ANNs.
Given \(L, \hat{n}, n_0, n_L\in \mathbb{N}\), and an activation function \(\mathscr{A}:\mathbb{R}\rightarrow \mathbb{R}\), the set of all multilayer perceptrons of depth \(L\) and width \(\hat{n}\) is denoted by
Further, we use the notation
to denote the set of all multilayer perceptrons of depth \(L\) and arbitrary but finite hidden widths.
Continuous Function Approximations
In this subsection, we study the expressivity feed forward neural networks as approximations to continuous function on a compact set. Let us assume that the input set \(\mathcal{X}\subset \mathbb{R}^{n_0}\) is compact.
We call a set \(\mathcal{H}\) a universal approximator of \(\mathbb{C}(\mathbb{R}^n)\) if, for every compact set \(\mathcal{X} \subset \mathbb{R}^n\), the restriction \(\mathcal{H}|_{\mathcal{X}} = \{h|_{\mathcal{X}} : h \in \mathcal{H}\}\) is dense in \(\mathbb{C}(\mathcal{X})\) with respect to the uniform norm. In other words, for every \(f \in \mathbb{C}(\mathcal{X})\) and every \(\varepsilon > 0\), there exists \(h \in \mathcal{H}\) such that
The following are some classical examples of universal approximators.
Recall that a polynomial function \(p : \mathbb{R}^n \to \mathbb{R}\) is of the form
where only finitely many coefficients \( c_{\boldsymbol{\alpha}} \in \mathbb{R} \) are nonzero, for each multi-index \(\boldsymbol{\alpha}=(\alpha_1, \alpha_2,\ldots, \alpha_n)\in \mathbb{N}_0^n\) the monomial is defined as
with degree of the monomial being \( |\boldsymbol{\alpha}| = \alpha_1 + \alpha_2 + \cdots + \alpha_n.\) The degree of the polynomial \( p \) is the largest \( |\boldsymbol{\alpha|} \) such that \( c_\alpha \neq 0 \).
Consider the function space \(\mathbb{C}(\mathbb{R}^n)\). By Stone-Weierstrass' theorem, the space of all polynomials \(\mathcal{P}(\mathcal{X})\) on any compact set \(\mathcal{X}\subset \mathbb{R}^n\) is a dense subspace of \(\mathbb{C}(\mathcal{X})\). Hence \(\mathcal{P}(\mathbb{R}^n)\) is an universal approximator of \(\mathbb{C}(\mathbb{R}^n)\).
The following lemma gives another dense subset of the space of continuous functions. We restrict our discussion to \(n=1\) and \(m=1\) and omit the discussions on multi-dimensions.
For every \(f \in \mathbb{C}([a,b])\) and every \(\varepsilon > 0\), there exists a simple function \(s_\varepsilon: [a,b]\rightarrow \mathbb{R}\) such that
Therefore, choosing a sufficiently large \(N>0\) such that \((b-a)/N<\delta\), we can have the uniform partition
on which we have
Define a step function \(s_\varepsilon:[a,b]\to\mathbb{R}\) by
Since, for any \(x\in [a,b)\), there exists a \(k\in \{0,1,\ldots, N-1\}\) such that \(x\in [x_k, x_{k+1})\), we have
For \(x=1\), the above estimate holds obviously.
Thus, we have
We next prove the denseness of the set of all shallow Heaviside networks.
For every \(f \in \mathbb{C}([a,b])\) and every \(\varepsilon > 0\), there exists \({\mathscr{F}_{2,\hat{n}_\varepsilon}} \in \mathscr{N}_{2}(H; 1, 1)\), for some finite \(\hat{n}_\varepsilon\in \mathbb{N}\), such that
where \(\mathscr{A}=H\) is the Heaviside activation function.
for some \(\hat{n}\in \mathbb{N}\), and \(w^{(1)}_{j1}, b^{(1)}_{j}, w^{(2)}_{1j}, b^{(2)}_{1} \in \mathbb{R},\) for \(l=1,2,\), and \(j=1,2,\ldots, \hat{n}\).
Let \(f \in \mathbb{C}([a,b])\) and \(\varepsilon > 0\) be given. We have to find \(\hat{n}_\varepsilon\in \mathbb{N}\) and augmented weights \(\overline{W}^{(1)}\in \mathbb{R}^{\hat{n}_\varepsilon \times 2}\) and \(\overline{W}^{(2)}\in \mathbb{R}^{1 \times (\hat{n}_\varepsilon +1)}\) such that the corresponding ANN \({\mathscr{F}_{2,\hat{n}}} \in \mathscr{N}_{2}(H; 1, 1)\) satisfies the required inequality.
Consider the step function \(s_\varepsilon:[a,b]\to\mathbb{R}\) defined in the proof of Lemma «Click Here» . We now express this step function \(s_\varepsilon\) as a linear combination of shifted Heaviside functions. For any \(x\in [a,b)\), there exists a \(k\in \{0,1,\ldots, N-1\}\) such that \(x\in [x_k, x_{k+1})\). Thus, we have
Thus, we can write the step function \(s_\varepsilon\) as
where \(c_j \in \mathbb{R}\) are defined as
Let us take \({\mathscr{F}_{2,\hat{n}_\varepsilon}} = s_\varepsilon\), with \(\hat{n}_\varepsilon = N\), \(w^{(1)}_{j1}=1,\) \(b^{(1)}_j = x_{j-1}\), \(w^{(2)}_{1j}=c_{j-1},\) \(j=1,2,\ldots, \hat{n}_\varepsilon\), and \(b^{(2)}_1 = 0\).
Then, we see that \({\mathscr{F}_{2,\hat{n}_\varepsilon}}\in \mathscr{N}_{2}(H; 1, 1)\) and from the estimate for \(s_\varepsilon\) obtain in the proof of Lemma «Click Here» , we see that
This proves the lemma.
Next, let us extend our discussion to the denseness of continuous piecewise affine functions.
For every \(f \in \mathbb{C}([a,b])\) and every \(\varepsilon > 0\), there exists a continuous piecewise affine function \(\varphi_\varepsilon: [a,b]\rightarrow \mathbb{R}\) such that
By uniform continuity of \(f\) on \([a,b]\), there exists a \(\delta=\delta(\varepsilon)>0\) and a \(N>0\) is sufficiently large with \((b-a)/N<\delta(\varepsilon)\), such that on the uniform partition
we have
Define the piecewise affine function
for \(k=0,1,\ldots, N-1\).
Since \(\varphi_\varepsilon(x)\) lies between \(f(x_k)\) and \(f(x_{k+1})\) for any \(x\in [x_k, x_{k+1}]\), and \(f\) is continuous, by intermediate value theorem, we can find a \(\xi\in [x_k, x_{k+1}]\) such that
Now let us take \(x\in [a,b]\) arbitrarily. We can find a \(k\in \{0,1,\ldots, N-1\}\) such that \(x\in [x_k, x_{k+1}]\) and consequently, we have
for some \(\xi\in [x_k, x_{k+1}]\). Since \(x,\xi\in [x_k, x_{k+1}]\), by the way we have chosen the partition, we see that
Since the above estimate holds for any \(x\in [a,b]\), we have
We now prove that the set of shallow \(\texttt{ReLU}\) networks is a universal approximator of \(\mathbb{C}(\mathbb{R})\). As we did in the above case, we achieve this in two steps. The first step is stated in the following lemma.
For every \(f \in \mathbb{C}([a,b])\) and every \(\varepsilon > 0\), there exists \({\mathscr{F}_{2,\hat{n}_\varepsilon}} \in \mathscr{N}_{2}(\texttt{ReLU}; 1, 1)\), for some finite \(\hat{n}_\varepsilon\in \mathbb{N}\), such that
for some \(\hat{n}\in \mathbb{N}\), and \(w^{(1)}_{j1}, b^{(1)}_{j}, w^{(2)}_{1j}, b^{(2)}_{1} \in \mathbb{R},\) for \(l=1,2,\), and \(j=1,2,\ldots, \hat{n}\).
Let \(f \in \mathbb{C}([a,b])\) and \(\varepsilon > 0\) be given. We have to find \(\hat{n}_\varepsilon\in \mathbb{N}\) and augmented weights \(\overline{W}^{(1)}\in \mathbb{R}^{\hat{n}_\varepsilon \times 2}\) and \(\overline{W}^{(2)}\in \mathbb{R}^{1 \times (\hat{n}_\varepsilon +1)}\) such that the corresponding ANN \({\mathscr{F}_{2,\hat{n}}} \in \mathscr{N}_{2}(\texttt{ReLU}; 1, 1)\) satisfies the required inequality.
Choose the partition
with \((b-a)/N<\delta(\varepsilon)\) as in the proof of Lemma «Click Here» .
Let us take \(\hat{n}_\varepsilon = N\), \(w^{(1)}_{j1}=1,\) \(b^{(1)}_j = x_{j-1}\). Then, we have
The distributional (weak) derivative of \({\mathscr{F}_{2,\hat{n}_\varepsilon}}\) is a simple function (how?). Therefore, \({\mathscr{F}_{2,\hat{n}_\varepsilon}}\) is a continuous piecewise affine function. Hence, from Lemma «Click Here» , it is enough to choose the weights \(w^{(2)}_{1j}\), \(j=1,2,\ldots, \hat{n}_\varepsilon\) and the bias \(b^{(2)}_1\) such that
From the proof of Lemma «Click Here» , we define the piecewise affine function
for \(k=0,1,\ldots, N-1\).
Case 0: Since \( \varphi_\varepsilon(x_0) = f(x_0) \) and \({\mathscr{F}_{2,\hat{n}_\varepsilon}}(x_0) = -b^{(2)}_1,\) we choose \(b^{(2)}_1=-f(x_0)\).
Case 1: Since \( \varphi_\varepsilon(x_1) = f(x_1) \) and
by equating the above two expressions, and noting that the uniform partition implies \(x_{k+1}-x_k = (b-a)/\hat{n}_\varepsilon\), we have
Case 2: Similarly, \( \varphi_\varepsilon(x_2) = {\mathscr{F}_{2,\hat{n}_\varepsilon}}(x_2) \) implies
and so on.
General Case: Continuing in this way, we can get
Thus, we obtained all the weights and biases so that \({\mathscr{F}_{2,\hat{n}_\varepsilon}}(x_k)=\varphi(x_k)\), for \(k=0,1,\ldots, \hat{n}_\varepsilon.\) Since both these functions are piecewise linear, they are equal for all \(x\in [a,b]\). Thus, the required estimate follows from Lemma «Click Here» .
The class \(\mathscr{N}_{2}(\texttt{ReLU}; 1, 1)\) is a universal approximator of \(\mathbb{C}(\mathbb{R})\).
We have to show that there exists \({\mathscr{F}_{2,\hat{n}_\varepsilon}}\in\mathscr{N}_{2}(\texttt{ReLU};1,1)\) such that
Since \(\mathcal{X}\) is compact it is closed in \(\mathbb{R}\). Since \(f\) is continuous on \(\mathcal{X}\), by the Tietze extension theorem there exists a continuous function \(\widetilde f\) on some closed and bounded interval \([a,b]\) with \(\mathcal{X}\subset[a,b]\) such that \(\widetilde f|_{\mathcal{X}} = f\).
By Lemma «Click Here» , we can obtain an ANN \({\mathscr{F}_{2,\hat{n}_\varepsilon}}\in\mathscr{N}_{2}(\texttt{ReLU};1,1)\) with width \(\hat{n}_\varepsilon=N\) that satisfies the estimate
Restricting this estimate to \(\mathcal{X}\subset[a,b]\) yields
Thus \({\mathscr{F}_{2,\hat{n}_\varepsilon}}\in\mathscr{N}_2(\texttt{ReLU};1,1)\) provides the required uniform approximation on \(\mathcal{X}\). Since \(\mathcal{X}\) and \(f\) were arbitrary, \(\mathscr{N}_2(\texttt{ReLU};1,1)\) is dense in \(C(\mathcal{X})\) for every compact \(\mathcal{X}\subset\mathbb{R}\), i.e. it is a universal approximator of \(C(\mathbb{R})\).
Next, we prove the denseness of shallow sigmoid networks. We discuss this result in a general sense by introducing the sigmoidal functions in which the logistic function is a particular case.
A function \(\varphi:\mathbb{R}\to\mathbb{R}\) is said to be sigmoidal if
We now prove denseness of shallow sigmoidal networks under stronger assumptions than needed in order to make the proof simple.
Let \(\mathscr{S}:\mathbb{R}\rightarrow \mathbb{R}\) be a continuously differentiable monotonically increasing sigmoidal function, where \(\mathscr{S}'\) is bounded in \(\mathbb{R}\).
For every \(f \in \mathbb{C}([a,b])\) and every \(\varepsilon > 0\), there exists \({\mathscr{F}_{2,\hat{n}_\varepsilon}} \in \mathscr{N}_{2}(\mathscr{S}; 1, 1)\), for some finite \(\hat{n}_\varepsilon\in \mathbb{N}\), such that
By Lemma «Click Here» , we can find a simple function
such that
For a fixed \(\alpha>0\), define
where
For any fixed \(\alpha>0\), since \( \int\limits_{-\infty}^{\infty} \psi_\alpha(x) dx = 1, \) we can write
Further, for any \(\epsilon'>0\), we can write the above integral as
Let us denote the three integrals on the right hand side of the above equation as \(I_k\), \(k=1,2,3\), and take the modulus on both sides to get
It can be shown that \(\psi_\alpha \rightarrow \delta\) as \(\alpha \rightarrow 0\) in the weak sense (in the sense of distribution). Therefore, \(\psi_\alpha \rightarrow 0\) as \(\alpha \rightarrow 0\) pointwise on \((-\infty, \epsilon')\) and also on \((\epsilon', \infty)\). Further, since \(\psi_\alpha\) is integrable on \(\mathbb{R}\) and \(s_\varepsilon\) is bounded, by Dominated Convergence Theorem, we see that \(I_1\rightarrow 0\) and \(I_2\rightarrow 0\) as \(\alpha\rightarrow 0\). Therefore, we can see that
On the other hand, by choosing \(\epsilon'<(b-a)/N\), we can see that
Combining the estimates of the three integrals, we get
Thus, we have
Combining the two inequalities, we get
It is enough to show that \(s_\alpha\in \mathscr{N}_{2}(\mathscr{S}; 1, 1)\).
Up on using properties of convolution operator, we get
where the derivative in the above expression is taken in the distribution sense. Noting that, in the distributional sense
we get
Since any \({\mathscr{F}_{2,\hat{n}_\varepsilon}}\in \mathscr{N}_{2}(\mathscr{S}; 1, 1)\) is of the form
By choosing \(\hat{n}_\varepsilon = N\), and \(w^{(1)}_{j1}=\frac{1}{\alpha}\), \(b^{(1)}_{j}=\frac{x_{j-1}}{\alpha}\), \(w^{(2)}_{1j}=c_{j-1}\), and \(b^{(2)}_{1}=0\), \(j=1,2,\ldots, N\), we see that \(s_\varepsilon = {\mathscr{F}_{2,\hat{n}_\varepsilon}}\in \mathscr{N}_{2}(\mathscr{S}; 1, 1)\). This completes the proof.
We also have the universal approximation theorem for sigmoidal networks
The class \(\mathscr{N}_{2}(\mathscr{S}; 1, 1)\) is a universal approximator of \(\mathbb{C}(\mathbb{R})\).
Proof is left as an exercise.
We can extend the universal approximation property of shallow networks to multi-dimensional, i.e., to the class \(\mathscr{N}_{2}(\mathscr{S}; n_0, 1)\), for any \(n_0\ge 1\). The theorem is called the Cybenko theorem, which we state here and omit the proof for this course.
Let \(\mathscr{S}:\mathbb{R}\rightarrow \mathbb{R}\) be a continuous sigmoidal function and let \(I_{n_0}=[0,1]^{n_0}\). Then, \(\mathscr{N}_{2}(\mathscr{S}; n_0, 1)\) is dense in \(C(I_{n_0})\).
Calin, Ovidiu, Deep Learning Architectures: A Mathematical Approach, Springer, 2020.
The following theorem categorizes the activation functions that can lead to universal approximators within shallow networks.
Petersen, Philipp and Zech, Jakob, Mathematical theory of deep learning, 2025.
Feature Mapping Perspective
Having discussed the universal approximation capacity of shallow networks and noting that this expressive power also extends to deep networks, we now ask the question of What makes neural networks so expressive? An answer to this question may be to view the hidden layers as performing a nonlinear feature mapping of the input space.
Just for notational convenience, let us restrict our discussions in this subsection to the set \(\mathscr{N}_{2}(\mathscr{A}, n_0, 1)\), which can be extended to any shallow or deep networks in a similar and straightforward way.
Any function \({\mathscr{F}_{2,\hat{n}}}\in \mathscr{N}_{2}(\mathscr{A}, n_0,1)\) is of the form
Let us denote the activation from the hidden layer as
Then, we can write the network function as
This shows that the hidden layer plays the role of the feature mapping that transforms the input to a feature space which is taken as the input space for the output layer.
In traditional machine learning, a feature map is often chosen manually or implicitly through a kernel function (as in SVMs). Whereas, neural networks learn the feature map automatically from the dataset by adjusting the parameters of the hidden layer through optimization.
Connection to kernel machines
We now compare the feature-mapping viewpoint of shallow networks with the general form of a kernel machine.
Let us first recall the general form of a kernel machine.
For an input space \(\mathcal{X}\subseteq \mathbb{R}^{n_0}\), let \(\text{π}:\mathcal{X}\times\mathcal{X}\to\mathbb{R}\) be a symmetric positive definite kernel and let \(\mathbb{H}_\text{π}\) be a Hilbert space (called Reproducing Kernel Hilbert Spaces, RKHS) with feature map \(\boldsymbol{\phi}_\text{π}:\mathcal{X}\to\mathbb{H}_\text{π}\) satisfying
A linear kernel machine (for supervised learning) produces predictions of the form
where \(\{\boldsymbol{x}_i\}_{i=1}^N\) are the training inputs and the coefficients \(\{\alpha_i\}\) are obtained by minimizing a chosen cost function with penalty.
By taking
we can write the predictions of a linear kernel machine as
where \(\tilde{\alpha}_i = \alpha_i y_i,\) for \(i=1,2,\ldots, N.\)
Comparing the above form of kernel machine function with (6.8), we see that neural networks generalize the idea of kernel machines by learning both the feature representation and the output model/layer (which is a linear model in our present case).
From this viewpoint, universal approximation theorems can be reinterpreted as stating that there exists a set of learned features \(\{\phi_j\}_{j=1}^{\hat{n}}\) such that the linear space \(\mathcal{F}:=\text{span}\{\phi_j\}_{j=1}^{\hat{n}}\) is dense (with respect to the uniform norm) in \(\mathbb{C}(\mathcal{X})\) for any compact input domain \(\mathcal{X}\).
Learning dynamics and the Neural Tangent Kernel
In the previous discussion, we viewed shallow neural networks as linear models in a learned feature space. We also saw that kernel machines correspond to linear models in a fixed (possibly infinite-dimensional) feature space determined by a kernel function. We now show that, in a certain infinite-width limit, neural networks themselves naturally induce a kernel, called the Neural Tangent Kernel (NTK), which connects learning dynamics in neural networks with kernel regression.
During training, the parameters are updated by gradient descent update rule
for a chosen learning rate \(\eta > 0\). Each update changes the output function \({\mathscr{F}_{L,\hat{n}}}(\boldsymbol{x}; \boldsymbol{\Theta}_k)\). Linearizing the network function around \(\boldsymbol{\Theta}_k\) for a very small \(\eta>0\), we get
where \(\nabla_{\boldsymbol{\Theta}} {\mathscr{F}_{L,\hat{n}}}\) denotes the Jacobian of the network output with respect to the parameters.
Using the gradient descent update rule, we get
which can be rewritten as
Consider a supervised dataset \(\{(\boldsymbol{x}_i, \boldsymbol{y}_i)\}_{i=1}^N\) and the quadratic cost function
The gradient of the cost function in the parameter space is given by
Substituting this expression in the above expression, we get
Define
which is called the Neural Tangent Kernel (NTK).
The change in the value of the neural network function at a point \(\boldsymbol{x}\) due to the update in the parameters is given (up to first order) in terms of NTK as
By taking \(t_{k}=k\eta\) as a discrete time, we can see that the above rule is the explicit Euler discretization of the system of ordinary differential equations
Since NTK depends on the gradient of \({\mathscr{F}_{L,\hat{n}}}\), it follows that \(\text{π}_{\hat{n}}\) depends implicitly on \({\mathscr{F}_{L,\hat{n}}}\) through its parameter \(\boldsymbol{\Theta}(t)\). Hence, the RHS of the above system is nonlinear in \({\mathscr{F}_{L,\hat{n}}}\), which means that the above system is a nonlinear system of ODEs.
By increasing the hidden layer width \(\hat{n}\), we can show (proof omitted for this course) that for a sufficiently wide network with the standard NTK parameterisation (weights scaled by \(1/\sqrt{\hat n}\)) and random initialization, we have
where the approximation improves as \(\hat n\to\infty\) (in probability, and uniformly for \(t\) in any fixed interval \([0,T]\)), under mild regularity assumptions on the activation. Consequently the empirical NTK converges to a deterministic, time-invariant kernel \(\text{π}_\infty\) and becomes independent of \(t\). That is,
Therefore, in the infinite-width limit the network function evolves (approximately) according to the linear system
making the training dynamics linear in function space.
Considering the regularized cost function with the \(\ell^2\)-regularization in the function space (rather than in the parameter space) given by
where \(\alpha>0\) is the penalization parameter or ridge coefficient.
Then, the above system takes the form
Assume that the sequence \(\{\boldsymbol{\Theta}_k\}\) generated by the gradient descent update rule converges to \(\boldsymbol{\Theta}^*\) as \(k\rightarrow \infty\). Consequently, as \(t\rightarrow \infty\), we obtain the steady state \(\dfrac{d{\mathscr{F}_{L,\infty}}}{dt} = 0 \), which implies
Taking
we can write
Let us use the notation
where each coordinate \(\boldsymbol{\beta}_i\) is a block column vector, and for the given input set \(\mathcal{X} = \{\boldsymbol{x}_i\}_{i=1}^N,\) let
be the \(N\times N\) symmetric block Gram matrix whose \((i,j)\)-th block is the Gram matrix \(\text{π}_\infty(\boldsymbol{x}_i, \boldsymbol{x}_j)\). With these notations, we can write
where
is the block row vector with each component being the \(n_L\times n_L\) matrix.
The above expression shows that \({\mathscr{F}_{L,\infty}}\) lies in the span of the kernel functions \(\{ \text{π}_\infty(\cdot, \mathbf{x}_i) \}_{i=1}^N\), i.e., \({\mathscr{F}_{L,\infty}}\) resembles a kernel machine.
Physics Informed Neural Networks for ODEs
Feedforward neural networks can be used to learn solutions of initial and/or boundary value problems for differential equations (ODEs and PDEs). In this section, we restrict our discussions to initial value problem for first-order ordinary differential equations (ODEs) of the form
where \(y=y(t)\) is the solution for a given (sufficiently regular) function \(f:[0,T]\times\mathbb{R}\to\mathbb{R}\), for some \(T>0\), and \(y_0\in\mathbb{R}\) is the prescribed initial value. We seek an approximation \(\widehat{y}(t)\approx y(t)\) for \(t\in[0,T].\)
We briefly discuss how a feedforward neural network (FNN) can be used to obtain an approximation \(\widehat{y}(\cdot)={\mathscr{F}_{L,\hat{n}}}(\cdot;\boldsymbol{\Theta})\).
The mapping \(t\mapsto {\mathscr{F}_{L,\hat{n}}}(t;\boldsymbol{\Theta})\) is chosen smooth enough so that time derivatives of the network output are defined via automatic differentiation.
Trial Solution
There are two ways that we can define the trial solution.
Hard-constrained Formulation:
To enforce the initial condition exactly, define the trial solution
By construction \(\widetilde{y}(0;\boldsymbol{\Theta})=y_0\) for every choice of parameters \(\boldsymbol{\Theta}\). Therefore, the neural network function \({\mathscr{F}_{L,\hat{n}}}(t;\boldsymbol{\Theta})\) is the corrective term scaled by \(t\), and \(\widetilde{y}\) is the required approximate solution of the given IVP which satisfies the initial condition exactly.
Soft-constrained Formulation:
Here, the trial solution itself is represented by the neural network function
and the initial condition is not enforced exactly rather imposed through a penalty term in the cost function. Thus, in this case, the neural network function is the required approximate solution of the given IVP which satisfies the initial condition approximately.
Residual and Training Loss
Define the pointwise residual loss function of the ODE for the trial solution as
Select a set of collocation points \(\{t_j\}_{j=0}^N \subset [0,T]\) with \(t_0=0\) and \(\Delta t_j = t_j-t_{j-1}\), \(j=1,2,\ldots, N\). The mean squared residual cost function is defined as
The collocation points may be chosen uniformly, randomly, or using adaptive strategies. The residual cost function ensures that the neural network satisfies the ODE (6.10) approximately. We also need to enforce the initial condition \(y(0)=y_0\), which can be done by adding a penalty corresponding to the initial condition as
Finally, the total cost function is given by
where \(\lambda > 0\) is a weighting parameter controlling the relative importance of satisfying the initial condition.
Differentiating \(\widetilde{y}(t;\boldsymbol{\Theta})\) in the residual loss function is equivalent to differentiating the neural network function \({\mathscr{F}_{L,\hat{n}}}(t;\boldsymbol{\Theta})\), which we need to compute as a part of the optimization method. There are two ways that we can compute \(\frac{d}{dt}{\mathscr{F}_{L,\hat{n}}}(t;\boldsymbol{\Theta})\).
- We can perform exact computation of \(\frac{d}{dt}{\mathscr{F}_{L,\hat{n}}}(t;\boldsymbol{\Theta})\) using automatic differentiation (not covered in this course) of the FNN.
- We can also approximate \(\frac{d}{dt}\widetilde{y}(t;\boldsymbol{\Theta})\) at the collocation points using backward difference operator given by
\[ \frac{d}{dt}\widetilde{y}(t;\boldsymbol{\Theta}) \approx \frac{\widetilde{y}(t_{k};\boldsymbol{\Theta}) - \widetilde{y}(t_{k-1};\boldsymbol{\Theta})}{\Delta t_i}. \]
Extending the ideas of PINNs from ODEs to PDEs is not a straight forward task. However, in certain simple PDE problems, the PINN formulation is simple.
Training Algorithm
The training goal is to find a parameter set \(\boldsymbol{\Theta}^\star\) such that
As an exact solution cannot be obtained for the above unconstrained optimization problem, we may use any of the standard gradient-based optimizers (like mini-batch GD, SGD with momentum or Adam) to obtain an approximation to \(\boldsymbol{\Theta}^\star\).
Input:
Processing:
For \(k=1,2,\ldots\), do the following steps:
Output: Take \(\boldsymbol{\Theta}^\star = \boldsymbol{\Theta}_k\), which the parameter set returned by the optimization method. Return \(\widetilde{y}(t;\boldsymbol{\Theta}^\star)\) as the approximate solution.
The code is developed using PyTorch. Go through the code carefully and understand it.
The same can be developed using TensorFlow.
The output of the code is shown below, where the PINN solution is compared with the exact solution of the given initial and boundary value problem for the linear advection equation.
Research Perspectives
Developing suitable architectures and efficient learning procedures for PINNs and their variants is an active research area. They form an important part of a growing branch of Machine Learning, known as Scientific Machine Learning (SciML). The importance of SciML lies in its wide range of applications, such as
and so on.
The development and analysis of PINNs and their variants broadly goes through the following steps:
At present, there is active research towards using the Neural Tangent Kernel (NTK) framework to analyze, and in some cases improve, the training dynamics and generalization properties of various types of neural networks, including transformers (to a certain extent), and in particular, of PINNs.
The NTK perspective is mainly used in analyzing the convergence behavior, spectral bias, and optimization dynamics of neural networks. Thus, NTK may be used to bridge the gap between theoretical analysis and practical implementation.