Expressivity over Function Spaces

Having understood the fundamental theory of deep neural networks from the previous chapters, we now ask the most important question of

Why should neural networks work?

This is a difficult question to answer from a practical perspective. We can certainly point to abundant empirical evidence from real-world applications to believe that deep learning models are successfully used in a wide range of domains such as image recognition, natural language processing, and so on. However, to provide a more concrete and theoretical answer to the above question, we turn to several remarkable mathematical results concerning the function approximation capacity of neural networks. In this chapter, we review a few of these results.

A neural network can be effectively used in two major ways. One is for pattern recognition, which essentially corresponds to a classification problem as illustrated often in our previous chapters. Another important application of deep learning is to use it as a function approximation tool, in particular in regression, where the goal is to approximate an underlying functional relationship between inputs and outputs. In fact, classification problems can also be viewed as function approximation as we are effectively approximating a function that maps the input vectors to discrete output labels.

In recent years, deep learning has also been used to construct approximate solutions to differential equations by incorporating the underlying physics into the cost function through residual errors. Such networks are known as physics informed neural networks (PINN). PINNs provide a natural conceptual bridge from approximation theory to scientific machine learning.

In this chapter, we review few basic results that demonstrate the approximation capacity of neural networks. In Section «Click Here», we shall see that even a simple feedforward network with a single hidden layer can approximate, to any desired accuracy, any continuous function defined on a compact domain. This remarkable property is known as the universal approximation property.

Function Approximations

Approximating a nonlinear function is challenging because different types of nonlinearities require different function classes for effective approximation. The ultimate requirement in approximation theory is to identify a suitable function class within which we can find an approximation to any given nonlinear function to a desired accuracy. Such a property is referred to as universal approximation. Deep neural networks are known to be good universal approximators, in the sense that, by suitably choosing their architecture, one can achieve any desired accuracy for a wide class of target functions.

Let \(\mathcal{X}\) denote the input space and \(\mathcal{Y}\) the target space.

Define a function class

\[ \mathbb{F}(\mathcal{X},\mathcal{Y}):= \big\{ f:\mathcal{X} \rightarrow \mathcal{Y}~\big|~ f ~ \text{possesses certain properties}\big\}. \]

A hypothesis class is a subclass \(\mathcal{H}\subseteq \mathbb{F}\) from which we seek an approximation to a given target function \(f^*\in \mathbb{F}\).

In what follows, we consider the function class \(\mathbb{F}\) as a linear space equipped with a norm, thereby making it a normed linear space. In this case, we call the normed linear space a function space, especially if it is a Banach space or a Hilbert space.

Example:

Continuous Functions:

Let \(\mathcal{X}\subset \mathbb{R}^{n}\) be compact and \(\mathcal{Y}\subseteq \mathbb{R}^m\). Then,

\[ \mathbb{F}(\mathcal{X},\mathcal{Y}):= \big\{ f:\mathcal{X} \rightarrow \mathcal{Y}~\big|~ f ~ \text{is continuous}\big\} \]

is a function space with the uniform norm (or supremum norm) defined as

\[ \|f\|_{\infty,\mathcal{X}} := \sup\{\|f(\boldsymbol{x})\|~|~\boldsymbol{x}\in \mathcal{X} \}, \]

where \(\|\cdot\|\) on the right hand side denotes a vector norm on \(\mathbb{R}^m\), which is often taken as the Euclidean norm. One may also take the \(\ell^1\)-norm or the maximum norm.

We use the notation \(\mathbb{C}(\mathcal{X}, \mathcal{Y})\) to denote the space of continuous functions. In particular, if \(\mathcal{Y}=\mathbb{R}\), then we use the notation \(\mathbb{C}(\mathcal{X})\).

Since \(\mathcal{X}\) is compact, \((\mathbb{C}(\mathcal{X}, \mathcal{Y}), \|\cdot\|_{\infty,\mathcal{X}})\) is a Banach space.

Continuously Differentiable Functions:

Let \(\mathcal{X}\subset \mathbb{R}^{n}\) be compact, \(\mathcal{Y}\subseteq \mathbb{R}^m\), and let \(k\in \mathbb{N}\). Then,

\[ \mathbb{F}(\mathcal{X},\mathcal{Y}):= \big\{ f:\mathcal{X} \rightarrow \mathcal{Y}~\big|~ f ~ \text{is } k\text{-times continuously differentiable}\big\} \]

is a function space with the norm defined as

\[ \|f\|_{\infty,k,\mathcal{X}} := \max_{|\boldsymbol{\alpha}|\le k}~\big(\sup\{\|\partial^{\boldsymbol{\alpha}} f(\boldsymbol{x})\|~|~\boldsymbol{x}\in \mathcal{X}, \boldsymbol{\alpha}=(\alpha_1,\ldots,\alpha_n)\in \mathbb{N}_0^n \} \big), \]

where \(\|\cdot\|\) on the right hand side is a vector norm on \(\mathbb{R}^m\), \(|\boldsymbol{\alpha}|=\alpha_1+\ldots+\alpha_n,\) and \(\mathbb{N}_0^n = (\mathbb{N}\cup \{0\})^n\) is the multi-index set.

We use the notation \(\mathbb{C}^k(\mathcal{X}, \mathcal{Y})\) to denote the space of \(k\)-times continuously differentiable functions and is a Banach space since \(\mathcal{X}\) is compact.

We use the notation \(\mathbb{C}^k(\mathcal{X})\), if \(\mathcal{Y}=\mathbb{R}\).

\(L^p\) Functions:

Let \((\mathcal{X}, \Sigma, \mu)\) be a measure space and \(\mathcal{Y} = \mathbb{R}^m\). Then,

\[ \mathbb{F}(\mathcal{X}, \mathcal{Y}) := \Big\{ f:\mathcal{X} \rightarrow \mathcal{Y}~\big|~ \int_{\mathcal{X}} \|f(\boldsymbol{x})\|^p\, d\mu(\boldsymbol{x}) < \infty \Big\}, \quad 1 \le p < \infty, \]

where \(\|\cdot\|\) denotes a vector norm on \(\mathbb{R}^m\), often taken as the Euclidean norm.

The corresponding norm on this space is defined as

\[ \|f\|_{L^p} := \Bigg( \int_{\mathcal{X}} \|f(\boldsymbol{x})\|^p\, d\mu(\boldsymbol{x}) \Bigg)^{\!1/p}. \]

For \(p = \infty\), we define

\[ \|f\|_{L^\infty} := \operatorname*{ess\,sup}_{\boldsymbol{x} \in \mathcal{X}} \|f(\boldsymbol{x})\|, \]

where \(\operatorname*{ess\,sup}\) denotes the essential supremum with respect to the measure \(\mu\).

The space of such functions is denoted by \(\mathbb{L}^p(\mathcal{X}, \mathcal{Y})\), and is called the \(L^p\)-space. The \(L^p\) spaces are Banach spaces, and in the special case \(p=2\), \(\mathbb{L}^2(\mathcal{X}, \mathcal{Y})\) is a Hilbert space with the inner product

\[ \langle f, g \rangle_{L^2} := \int_{\mathcal{X}} \langle f(\boldsymbol{x}), g(\boldsymbol{x}) \rangle\, d\mu(\boldsymbol{x}), \]

where \(\langle \cdot, \cdot \rangle\) is the Euclidean inner product on \(\mathbb{R}^m\).

It is common to use the notation \(\mathbb{L}^p(\mathcal{X})\) if \( \mathcal{Y}=\mathbb{R}\).

Sobolev Functions:

Let \(\mathcal{X}\subset \mathbb{R}^{n}\) be an open set and \(\mathcal{Y} = \mathbb{R}^m\). For \(k \in \mathbb{N}\) and \(1 \le p \le \infty\), we define the function class

\[ \mathbb{F}(\mathcal{X}, \mathcal{Y}) := \Big\{ f:\mathcal{X} \rightarrow \mathcal{Y}~\big|~ \partial^{\boldsymbol{\alpha}} f \in L^p(\mathcal{X}, \mathcal{Y}) \text{ for all multi-indices } \boldsymbol{\alpha} \text{ with } |\boldsymbol{\alpha}| \le k \Big\}. \]

The corresponding norm is defined as

\[ \|f\|_{W^{k,p}(\mathcal{X})} := \Bigg( \sum_{|\boldsymbol{\alpha}| \le k} \|\partial^{\boldsymbol{\alpha}} f\|_{L^p(\mathcal{X})}^p \Bigg)^{\!1/p}, \qquad 1 \le p < \infty, \]

and for \(p = \infty\),

\[ \|f\|_{W^{k,\infty}(\mathcal{X})} := \max_{|\boldsymbol{\alpha}| \le k} \|\partial^{\boldsymbol{\alpha}} f\|_{L^\infty(\mathcal{X})}. \]

The space of such functions is denoted by

\[ \mathbb{W}^{k,p}(\mathcal{X}, \mathcal{Y}), \]

and is called the Sobolev space.

The Sobolev spaces \(\mathbb{W}^{k,p}(\mathcal{X}, \mathcal{Y})\) are Banach spaces. For the special case \(p=2,\) we use the notation \(\mathbb{H}^k(\mathcal{X},\mathcal{Y}) := \mathbb{W}^{k,2}(\mathcal{X},\mathcal{Y})\), which is a Hilbert space with the inner product

\[ \langle f, g \rangle_{\mathbb{H}^k} := \sum_{|\boldsymbol{\alpha}| \le k} \langle \partial^{\boldsymbol{\alpha}} f, \partial^{\boldsymbol{\alpha}} g \rangle_{L^2(\mathcal{X})}. \]

Bounded Variation (BV) Functions:

Let \(\mathcal{X} \subset \mathbb{R}^n\) be an open set and \(\mathcal{Y} = \mathbb{R}\). A function \(f:\mathcal{X} \to \mathcal{Y}\) is said to be of bounded variation if \(f \in L^1(\mathcal{X})\) and its distributional derivatives are finite Radon measures.

The space of functions of bounded variation is denoted by

\[ \mathbb{BV}(\mathcal{X}) := \Big\{ f \in L^1(\mathcal{X}) \;\big|\; \|f\|_{\mathbb{BV}} < \infty \Big\}, \]

where the BV-norm is defined as

\[ \|f\|_{\mathbb{BV}} := \|f\|_{L^1(\mathcal{X})} + |Df|(\mathcal{X}), \]

and \(|Df|(\mathcal{X})\) denotes the total variation of the derivative measure \(Df\) over \(\mathcal{X}\).

The space \(\mathbb{BV}(\mathcal{X})\) is a Banach space with respect to the norm \(\|\cdot\|_{\mathbb{BV}}\).

Often, solutions of mathematical problems are functions, and it is necessary to look for the solution in an appropriate function space. For many problems, we may know that a solution exists in a given function space, but its explicit form is either unknown or too complicated to obtain. In such situations, we look for an approximation to the solution in the function space. However, the full function space is often too large that obtaining an approximation within this space becomes impossible, and we have to restrict to a more tractable subset, a hypothesis set. To ensure that this hypothesis set is sufficient to approximate any solution to a desired accuracy, it must be dense in the function space.

Definition:

[Dense Subset]

A subset \(\mathcal{H}\) of a normed linear space \(\mathbb{F}\) is said to be dense in \(\mathbb{F}\) if for every \(f\in \mathbb{F}\), and every \(\varepsilon>0\), there exists \(h\in \mathcal{H}\) such that

\[ \|f - h\| < \varepsilon. \]

Problem:

Let \(\mathcal{H}\) be a dense subset of the normed linear space \(\mathbb{F}\). Then show that for every \(f\in \mathbb{F}\), there exists a sequence \(\{h_n\}\) in \(\mathcal{H}\) such that \(h_n\rightarrow f\) as \(n\rightarrow \infty\).

We now introduce a notation for the set of all ANNs.

Remark:

[Family of Feedforward Neural Networks]

Given \(L, \hat{n}, n_0, n_L\in \mathbb{N}\), and an activation function \(\mathscr{A}:\mathbb{R}\rightarrow \mathbb{R}\), the set of all multilayer perceptrons of depth \(L\) and width \(\hat{n}\) is denoted by

\[ \mathscr{N}_{L,\hat{n}}(\mathscr{A}; n_0, n_L) = \Big\{ {\mathscr{F}_{L,\hat{n}}}: \mathbb{R}^{n_0} \rightarrow \mathbb{R}^{n_L}~\Big|~ \mathscr{A}^{(l)} = \mathscr{A}, l=1,2,\ldots, L-1, \mathscr{A}^{(L)}(x) = x \Big\}. \]

Further, we use the notation

\[ \mathscr{N}_{L}(\mathscr{A}; n_0, n_L) = \displaystyle{\bigcup_{\hat{n}\in \mathbb{N}}} \mathscr{N}_{L,\hat{n}}(\mathscr{A}; n_0, n_L) \]

to denote the set of all multilayer perceptrons of depth \(L\) and arbitrary but finite hidden widths.

Continuous Function Approximations

In this subsection, we study the expressivity feed forward neural networks as approximations to continuous function on a compact set. Let us assume that the input set \(\mathcal{X}\subset \mathbb{R}^{n_0}\) is compact.

We call a set \(\mathcal{H}\) a universal approximator of \(\mathbb{C}(\mathbb{R}^n)\) if, for every compact set \(\mathcal{X} \subset \mathbb{R}^n\), the restriction \(\mathcal{H}|_{\mathcal{X}} = \{h|_{\mathcal{X}} : h \in \mathcal{H}\}\) is dense in \(\mathbb{C}(\mathcal{X})\) with respect to the uniform norm. In other words, for every \(f \in \mathbb{C}(\mathcal{X})\) and every \(\varepsilon > 0\), there exists \(h \in \mathcal{H}\) such that

\[ \sup_{x \in \mathcal{X}} |f(x) - h(x)| < \varepsilon. \]

The following are some classical examples of universal approximators.

Example:

[Multivariate Polynomial Function]

Recall that a polynomial function \(p : \mathbb{R}^n \to \mathbb{R}\) is of the form

\[ f(\boldsymbol{x}) = \sum_{\boldsymbol{\alpha} \in \mathbb{N}_0^n} c_{\boldsymbol{\alpha}}\, \boldsymbol{x}^{\boldsymbol{\alpha}},~\boldsymbol{x}=(x_1, x_2, \ldots, x_n)\in \mathbb{R}^n, \]

where only finitely many coefficients \( c_{\boldsymbol{\alpha}} \in \mathbb{R} \) are nonzero, for each multi-index \(\boldsymbol{\alpha}=(\alpha_1, \alpha_2,\ldots, \alpha_n)\in \mathbb{N}_0^n\) the monomial is defined as

\[ \boldsymbol{x}^{\boldsymbol{\alpha}} = \prod_{k=1}^n x_k^{\alpha_k} \]

with degree of the monomial being \( |\boldsymbol{\alpha}| = \alpha_1 + \alpha_2 + \cdots + \alpha_n.\) The degree of the polynomial \( p \) is the largest \( |\boldsymbol{\alpha|} \) such that \( c_\alpha \neq 0 \).

Consider the function space \(\mathbb{C}(\mathbb{R}^n)\). By Stone-Weierstrass' theorem, the space of all polynomials \(\mathcal{P}(\mathcal{X})\) on any compact set \(\mathcal{X}\subset \mathbb{R}^n\) is a dense subspace of \(\mathbb{C}(\mathcal{X})\). Hence \(\mathcal{P}(\mathbb{R}^n)\) is an universal approximator of \(\mathbb{C}(\mathbb{R}^n)\).

The following lemma gives another dense subset of the space of continuous functions. We restrict our discussion to \(n=1\) and \(m=1\) and omit the discussions on multi-dimensions.

Lemma:

[Expressivity of Simple Functions]

For every \(f \in \mathbb{C}([a,b])\) and every \(\varepsilon > 0\), there exists a simple function \(s_\varepsilon: [a,b]\rightarrow \mathbb{R}\) such that

\[ \|f - s_\varepsilon\|_{\infty,[a,b]} < \varepsilon. \]

Proof:

Since \(f\) is continuous on the compact interval \([a,b]\), it is uniformly continuous. Hence, by definition, for every \(\varepsilon>0\), we can find a \(\delta>0\) (depends only on \(\varepsilon\)) such that, for any \(x,y\in [a,b]\) with \(|x-y|<\delta\), we have \(|f(x) - f(y)|<\varepsilon.\)

Therefore, choosing a sufficiently large \(N>0\) such that \((b-a)/N<\delta\), we can have the uniform partition

\[ a = x_0 < x_1 < \dots < x_N = b, \]

on which we have

\[ |f(x) - f(y)| < \varepsilon, \quad \text{for all } x,y \in [x_k, x_{k+1}],~k=0,\dots,N-1. \]

Define a step function \(s_\varepsilon:[a,b]\to\mathbb{R}\) by

\[ \begin{array}{cc} s_\varepsilon(x) = f(x_k), &\text{for } x \in [x_k, x_{k+1}), \quad k = 0,1,\dots,N-1,\\ s_\varepsilon(1) = f(x_{N-1}).& \end{array} \]

Since, for any \(x\in [a,b)\), there exists a \(k\in \{0,1,\ldots, N-1\}\) such that \(x\in [x_k, x_{k+1})\), we have

\begin{eqnarray} |f(x) - s_\varepsilon(x)| &=& |f(x)-f(x_k) + f(x_k)-s_\varepsilon(x)|\\ &\le& |f(x)-f(x_k)| \\ &<& \varepsilon. \end{eqnarray}

(6.1)

For \(x=1\), the above estimate holds obviously.

Thus, we have

\[ \|f - s_\varepsilon\|_{\infty,[a,b]} < \varepsilon. \]

We next prove the denseness of the set of all shallow Heaviside networks.

Lemma:

[Expressivity of Shallow Heaviside Networks]

For every \(f \in \mathbb{C}([a,b])\) and every \(\varepsilon > 0\), there exists \({\mathscr{F}_{2,\hat{n}_\varepsilon}} \in \mathscr{N}_{2}(H; 1, 1)\), for some finite \(\hat{n}_\varepsilon\in \mathbb{N}\), such that

\[ \|f - {\mathscr{F}_{2,\hat{n}_\varepsilon}}\|_{\infty,[a,b]} < \varepsilon. \]

where \(\mathscr{A}=H\) is the Heaviside activation function.

Proof:

Every \({\mathscr{F}_{2,\hat{n}}} \in \mathscr{N}_{2}(H; 1, 1)\) takes the form

\[ {\mathscr{F}_{2,\hat{n}}}(x) = \sum_{j=1}^{\hat{n}}w^{(2)}_{1j}\, H\left(w^{(1)}_{j1} x - b^{(1)}_j\right) - b^{(2)}_1, x\in \mathbb{R}, \]

for some \(\hat{n}\in \mathbb{N}\), and \(w^{(1)}_{j1}, b^{(1)}_{j}, w^{(2)}_{1j}, b^{(2)}_{1} \in \mathbb{R},\) for \(l=1,2,\), and \(j=1,2,\ldots, \hat{n}\).

Let \(f \in \mathbb{C}([a,b])\) and \(\varepsilon > 0\) be given. We have to find \(\hat{n}_\varepsilon\in \mathbb{N}\) and augmented weights \(\overline{W}^{(1)}\in \mathbb{R}^{\hat{n}_\varepsilon \times 2}\) and \(\overline{W}^{(2)}\in \mathbb{R}^{1 \times (\hat{n}_\varepsilon +1)}\) such that the corresponding ANN \({\mathscr{F}_{2,\hat{n}}} \in \mathscr{N}_{2}(H; 1, 1)\) satisfies the required inequality.

Consider the step function \(s_\varepsilon:[a,b]\to\mathbb{R}\) defined in the proof of Lemma «Click Here» . We now express this step function \(s_\varepsilon\) as a linear combination of shifted Heaviside functions. For any \(x\in [a,b)\), there exists a \(k\in \{0,1,\ldots, N-1\}\) such that \(x\in [x_k, x_{k+1})\). Thus, we have

\[ H(x-x_j) = \left\{\begin{array}{cc} 1,&\text{if } j\le k\\ 0,&\text{if } j> k, \end{array}\right. ~\text{for}~j=0,1,\ldots, N. \]

Thus, we can write the step function \(s_\varepsilon\) as

\[ s_\varepsilon(x) = \sum_{j=0}^{N-1} c_j H(x - x_j), ~x\in [a,b), \]

where \(c_j \in \mathbb{R}\) are defined as

\[ c_0=f(x_0), ~c_j=f(x_j) - \sum_{i=0}^{j-1}c_i,~j=1,2,\ldots, N-1. \]

Let us take \({\mathscr{F}_{2,\hat{n}_\varepsilon}} = s_\varepsilon\), with \(\hat{n}_\varepsilon = N\), \(w^{(1)}_{j1}=1,\) \(b^{(1)}_j = x_{j-1}\), \(w^{(2)}_{1j}=c_{j-1},\) \(j=1,2,\ldots, \hat{n}_\varepsilon\), and \(b^{(2)}_1 = 0\).

Then, we see that \({\mathscr{F}_{2,\hat{n}_\varepsilon}}\in \mathscr{N}_{2}(H; 1, 1)\) and from the estimate for \(s_\varepsilon\) obtain in the proof of Lemma «Click Here» , we see that

\[ \|f - {\mathscr{F}_{2,\hat{n}_\varepsilon}}\|_{\infty, [a,b]} < \varepsilon. \]

This proves the lemma.

Next, let us extend our discussion to the denseness of continuous piecewise affine functions.

Lemma:

[Denseness of Continuous Piecewise Affine Functions]

For every \(f \in \mathbb{C}([a,b])\) and every \(\varepsilon > 0\), there exists a continuous piecewise affine function \(\varphi_\varepsilon: [a,b]\rightarrow \mathbb{R}\) such that

\[ \|f - \varphi_\varepsilon\|_{\infty,[a,b]} < \varepsilon. \]

Proof:

Let \(f \in \mathbb{C}([a,b])\) and \(\varepsilon>0\) be given.

By uniform continuity of \(f\) on \([a,b]\), there exists a \(\delta=\delta(\varepsilon)>0\) and a \(N>0\) is sufficiently large with \((b-a)/N<\delta(\varepsilon)\), such that on the uniform partition

\[ a = x_0 < x_1 < \dots < x_N = b, \]

we have

\[ |f(x) - f(y)| < \varepsilon, \quad \text{for all } x,y \in [x_k, x_{k+1}],~k=0,\dots,N-1. \]

Define the piecewise affine function

\[ \varphi_\varepsilon(x) = f(x_k) + t\big(f(x_{k+1}) - f(x_{k})\big),~ \text{for}~ x=x_k+t (x_{k+1}-x_k),~\text{with}~t\in [0,1], \]

for \(k=0,1,\ldots, N-1\).

Since \(\varphi_\varepsilon(x)\) lies between \(f(x_k)\) and \(f(x_{k+1})\) for any \(x\in [x_k, x_{k+1}]\), and \(f\) is continuous, by intermediate value theorem, we can find a \(\xi\in [x_k, x_{k+1}]\) such that

\[ \varphi_\varepsilon(x) = f(\xi). \]

Now let us take \(x\in [a,b]\) arbitrarily. We can find a \(k\in \{0,1,\ldots, N-1\}\) such that \(x\in [x_k, x_{k+1}]\) and consequently, we have

\[ |f(x) - \varphi_\varepsilon(x)| = |f(x) - f(\xi)|, \]

for some \(\xi\in [x_k, x_{k+1}]\). Since \(x,\xi\in [x_k, x_{k+1}]\), by the way we have chosen the partition, we see that

\[ |f(x) - \varphi_\varepsilon(x)|= |f(x) - f(\xi)| < \varepsilon. \]

Since the above estimate holds for any \(x\in [a,b]\), we have

\[ \|f-\varphi_\varepsilon\|_{\infty, [a,b]} = \sup_{x\in [a,b]} |f(x) - \varphi_\varepsilon(x)| < \varepsilon. \]

We now prove that the set of shallow \(\texttt{ReLU}\) networks is a universal approximator of \(\mathbb{C}(\mathbb{R})\). As we did in the above case, we achieve this in two steps. The first step is stated in the following lemma.

Lemma:

[Denseness of Shallow ReLU Networks]

For every \(f \in \mathbb{C}([a,b])\) and every \(\varepsilon > 0\), there exists \({\mathscr{F}_{2,\hat{n}_\varepsilon}} \in \mathscr{N}_{2}(\texttt{ReLU}; 1, 1)\), for some finite \(\hat{n}_\varepsilon\in \mathbb{N}\), such that

\[ \|f - {\mathscr{F}_{2,\hat{n}_\varepsilon}}\|_{\infty,[a,b]} < \varepsilon. \]

Proof:

Every \({\mathscr{F}_{2,\hat{n}}} \in \mathscr{N}_{2}(\texttt{ReLU}; 1, 1)\) takes the form

\[ {\mathscr{F}_{2,\hat{n}}}(x) = \sum_{j=1}^{\hat{n}}w^{(2)}_{1j}\, \texttt{ReLU}\left(w^{(1)}_{j1} x - b^{(1)}_j\right) - b^{(2)}_1, x\in \mathbb{R}, \]

for some \(\hat{n}\in \mathbb{N}\), and \(w^{(1)}_{j1}, b^{(1)}_{j}, w^{(2)}_{1j}, b^{(2)}_{1} \in \mathbb{R},\) for \(l=1,2,\), and \(j=1,2,\ldots, \hat{n}\).

Choose the partition

\[ a = x_0 < x_1 < \dots < x_N = b, \]

with \((b-a)/N<\delta(\varepsilon)\) as in the proof of Lemma «Click Here» .

Let us take \(\hat{n}_\varepsilon = N\), \(w^{(1)}_{j1}=1,\) \(b^{(1)}_j = x_{j-1}\). Then, we have

\[ {\mathscr{F}_{2,\hat{n}_\varepsilon}}(x) = \sum_{j=1}^{\hat{n}_\varepsilon}w^{(2)}_{1j}\, \texttt{ReLU}\left(x - x_{j-1}\right) - b^{(2)}_1, x\in \mathbb{R}, \]

The distributional (weak) derivative of \({\mathscr{F}_{2,\hat{n}_\varepsilon}}\) is a simple function (how?). Therefore, \({\mathscr{F}_{2,\hat{n}_\varepsilon}}\) is a continuous piecewise affine function. Hence, from Lemma «Click Here» , it is enough to choose the weights \(w^{(2)}_{1j}\), \(j=1,2,\ldots, \hat{n}_\varepsilon\) and the bias \(b^{(2)}_1\) such that

\[ {\mathscr{F}_{2,\hat{n}_\varepsilon}}(x) = \varphi_\varepsilon(x),~ \text{for all}~ x\in [a,b]. \]

From the proof of Lemma «Click Here» , we define the piecewise affine function

\[ \varphi_\varepsilon(x) = f(x_k) + \left(\frac{f(x_{k+1}) - f(x_{k})}{x_{k+1}-x_k}\right)(x-x_k). \]

for \(k=0,1,\ldots, N-1\).

Case 0: Since \( \varphi_\varepsilon(x_0) = f(x_0) \) and \({\mathscr{F}_{2,\hat{n}_\varepsilon}}(x_0) = -b^{(2)}_1,\) we choose \(b^{(2)}_1=-f(x_0)\).

Case 1: Since \( \varphi_\varepsilon(x_1) = f(x_1) \) and

\[ {\mathscr{F}_{2,\hat{n}_\varepsilon}}(x_1) = w^{(2)}_{11}\, \texttt{ReLU}\left(x_1 - x_0\right) + f(x_0) = w^{(2)}_{11}\, (x_1 - x_0) + f(x_0), \]

by equating the above two expressions, and noting that the uniform partition implies \(x_{k+1}-x_k = (b-a)/\hat{n}_\varepsilon\), we have

\[ w^{(2)}_{11} = \frac{\hat{n}_\varepsilon}{b-a} \big(f(x_{1}) - f(x_{0})\big). \]

Case 2: Similarly, \( \varphi_\varepsilon(x_2) = {\mathscr{F}_{2,\hat{n}_\varepsilon}}(x_2) \) implies

\[ w^{(2)}_{12} = \frac{\hat{n}_\varepsilon\big(f(x_{2}) - 2f(x_{1})+f(x_0)\big)}{b-a}. \]

and so on.

General Case: Continuing in this way, we can get

\[ w^{(2)}_{1j} = \frac{\hat{n}_\varepsilon}{b-a} \big(f(x_{j}) - 2f(x_{j-1})+f(x_{j-2})\big), ~~j=2,3,\ldots,\hat{n}_\varepsilon. \]

Thus, we obtained all the weights and biases so that \({\mathscr{F}_{2,\hat{n}_\varepsilon}}(x_k)=\varphi(x_k)\), for \(k=0,1,\ldots, \hat{n}_\varepsilon.\) Since both these functions are piecewise linear, they are equal for all \(x\in [a,b]\). Thus, the required estimate follows from Lemma «Click Here» .

Theorem:

[Universal Approximation by Shallow ReLU Networks]

The class \(\mathscr{N}_{2}(\texttt{ReLU}; 1, 1)\) is a universal approximator of \(\mathbb{C}(\mathbb{R})\).

Proof:

Let \(\mathcal{X}\subset\mathbb{R}\) be compact and let \(f\in\mathbb{C}(\mathcal{X})\) and \(\varepsilon>0\) be given.

We have to show that there exists \({\mathscr{F}_{2,\hat{n}_\varepsilon}}\in\mathscr{N}_{2}(\texttt{ReLU};1,1)\) such that

\[ \|f-{\mathscr{F}_{2,\hat{n}_\varepsilon}}\|_{\infty,\mathcal{X}}<\varepsilon. \]

Since \(\mathcal{X}\) is compact it is closed in \(\mathbb{R}\). Since \(f\) is continuous on \(\mathcal{X}\), by the Tietze extension theorem there exists a continuous function \(\widetilde f\) on some closed and bounded interval \([a,b]\) with \(\mathcal{X}\subset[a,b]\) such that \(\widetilde f|_{\mathcal{X}} = f\).

By Lemma «Click Here» , we can obtain an ANN \({\mathscr{F}_{2,\hat{n}_\varepsilon}}\in\mathscr{N}_{2}(\texttt{ReLU};1,1)\) with width \(\hat{n}_\varepsilon=N\) that satisfies the estimate

\[ \|\widetilde{f}-{\mathscr{F}_{2,\hat{n}_\varepsilon}}\|_{\infty,[a,b]}<\varepsilon. \]

Restricting this estimate to \(\mathcal{X}\subset[a,b]\) yields

\begin{eqnarray} \|f-{\mathscr{F}_{2,\hat{n}_\varepsilon}}\|_{\infty,\mathcal{X}} &=& \sup_{x\in\mathcal{X}} |\widetilde f(x)-{\mathscr{F}_{2,\hat{n}_\varepsilon}}(x)|\\ &\le& \sup_{x\in[a,b]} |\widetilde f(x)-{\mathscr{F}_{2,\hat{n}_\varepsilon}}(x)| \\ &=& \|\widetilde{f}-{\mathscr{F}_{2,\hat{n}_\varepsilon}}\|_{\infty,[a,b]}< \varepsilon. \end{eqnarray}

(6.2)

Thus \({\mathscr{F}_{2,\hat{n}_\varepsilon}}\in\mathscr{N}_2(\texttt{ReLU};1,1)\) provides the required uniform approximation on \(\mathcal{X}\). Since \(\mathcal{X}\) and \(f\) were arbitrary, \(\mathscr{N}_2(\texttt{ReLU};1,1)\) is dense in \(C(\mathcal{X})\) for every compact \(\mathcal{X}\subset\mathbb{R}\), i.e. it is a universal approximator of \(C(\mathbb{R})\).

Next, we prove the denseness of shallow sigmoid networks. We discuss this result in a general sense by introducing the sigmoidal functions in which the logistic function is a particular case.

Definition:

[Sigmoidal Function]

A function \(\varphi:\mathbb{R}\to\mathbb{R}\) is said to be sigmoidal if

\begin{eqnarray} \lim_{t\to -\infty}\varphi(t) = 0, \qquad \lim_{t\to +\infty}\varphi(t) = 1. \end{eqnarray}

(6.3)

Note:

The logistic functions \(\mathscr{S}_k\) are sigmoidal functions for any \(k>0\).

We now prove denseness of shallow sigmoidal networks under stronger assumptions than needed in order to make the proof simple.

Lemma:

[Denseness of Shallow Sigmoid Networks]

Let \(\mathscr{S}:\mathbb{R}\rightarrow \mathbb{R}\) be a continuously differentiable monotonically increasing sigmoidal function, where \(\mathscr{S}'\) is bounded in \(\mathbb{R}\).

For every \(f \in \mathbb{C}([a,b])\) and every \(\varepsilon > 0\), there exists \({\mathscr{F}_{2,\hat{n}_\varepsilon}} \in \mathscr{N}_{2}(\mathscr{S}; 1, 1)\), for some finite \(\hat{n}_\varepsilon\in \mathbb{N}\), such that

\[ \|f - {\mathscr{F}_{2,\hat{n}_\varepsilon}}\|_{\infty,[a,b]} < \varepsilon. \]

Proof:

Given \(f \in \mathbb{C}([a,b])\) and \(\varepsilon>0\).

By Lemma «Click Here» , we can find a simple function

\[ s_\varepsilon(x) = \sum_{i=0}^{N-1} c_i H(x - x_i), ~\text{with}~ |c_i|\le \frac{\varepsilon}{4} ~\text{(why?)} \]

such that

\[ \|f - s_\varepsilon\|_{\infty,[a,b]} < \frac{\varepsilon}{2}. \]

For a fixed \(\alpha>0\), define

\[ s_\alpha = (s_\varepsilon \ast \psi_\alpha) (x), \]

where

\[ \psi_\alpha(x) = \frac{1}{\alpha}\mathscr{S}'\!\left(\frac{x}{\alpha}\right) =: \mathscr{S}'_\alpha(x), \qquad \alpha > 0. \]

For any fixed \(\alpha>0\), since \( \int\limits_{-\infty}^{\infty} \psi_\alpha(x) dx = 1, \) we can write

\begin{eqnarray} s_\epsilon(x) - s_\alpha(x) &=& \int\limits_{-\infty}^{\infty} \psi_\alpha(y)\big(s_\epsilon(x) - s_\epsilon(x-y) \big) dy. \end{eqnarray}

(6.4)

Further, for any \(\epsilon'>0\), we can write the above integral as

\[ s_\epsilon(x) - s_\alpha(x) = \left( \int\limits_{-\infty}^{-\epsilon'} + \int\limits_{-\epsilon'}^{\epsilon'} + \int\limits_{\epsilon'}^{\infty}\right) \psi_\alpha(y)\big(s_\epsilon(x) - s_\epsilon(x-y) \big) dy \]

Let us denote the three integrals on the right hand side of the above equation as \(I_k\), \(k=1,2,3\), and take the modulus on both sides to get

\[ |s_\epsilon(x) - s_\alpha(x)| \le |I_1| + |I_2| + |I_3|. \]

It can be shown that \(\psi_\alpha \rightarrow \delta\) as \(\alpha \rightarrow 0\) in the weak sense (in the sense of distribution). Therefore, \(\psi_\alpha \rightarrow 0\) as \(\alpha \rightarrow 0\) pointwise on \((-\infty, \epsilon')\) and also on \((\epsilon', \infty)\). Further, since \(\psi_\alpha\) is integrable on \(\mathbb{R}\) and \(s_\varepsilon\) is bounded, by Dominated Convergence Theorem, we see that \(I_1\rightarrow 0\) and \(I_2\rightarrow 0\) as \(\alpha\rightarrow 0\). Therefore, we can see that

\[ |I_1| + |I_3| < \frac{\varepsilon}{4}, ~ \text{for sufficiently small}~\alpha. \]

On the other hand, by choosing \(\epsilon'<(b-a)/N\), we can see that

\begin{eqnarray} |I_2| &\le& \int\limits_{-\epsilon'}^{\epsilon'} \psi_\alpha(y)\big|s_\epsilon(x) - s_\epsilon(x-y) \big| dy \\ &\le& |c_k|\int\limits_{-\epsilon'}^{\epsilon'} \psi_\alpha(y) dy,~~\text{for some }~k\in \{1,2,\ldots, N\}\\ &\le& \frac{\varepsilon}{4}\int\limits_{-\infty}^{\infty} \psi_\alpha(y) dy\\ &=&\frac{\varepsilon}{4}. \end{eqnarray}

(6.5)

Combining the estimates of the three integrals, we get

\[ |s_\epsilon(x) - s_\alpha(x)| \le \frac{\varepsilon}{2},~\text{for all}~ x\in [a,b]. \]

Thus, we have

\[ \|s_\epsilon - s_\alpha\|_{\infty,[a,b]} < \frac{\varepsilon}{2}. \]

Combining the two inequalities, we get

\[ \|f - s_\alpha\|_{\infty,[a,b]} < \varepsilon. \]

It is enough to show that \(s_\alpha\in \mathscr{N}_{2}(\mathscr{S}; 1, 1)\).

Up on using properties of convolution operator, we get

\begin{eqnarray} s_\alpha(x) &=& (s_\varepsilon * \psi_\alpha)(x)\\ &=& \left(s_\varepsilon *\mathscr{S}_{\alpha}'\right)(x)\\ &=& \left(s_\varepsilon' *\mathscr{S}_{\alpha}\right)(x)\\ &=& \sum_{i=0}^{N-1} c_i \frac{d}{dx}H(x - x_i)*\mathscr{S}_{\alpha}(x), \end{eqnarray}

(6.6)

where the derivative in the above expression is taken in the distribution sense. Noting that, in the distributional sense

\[ \frac{d}{dx}H(x - x_i) = \delta(x-x_i) = \delta_{x_i}(x), \]

we get

\begin{eqnarray} s_\alpha(x) &=& \left(\sum_{i=0}^{N-1} c_i \delta_{x_i}*\mathscr{S}_{\alpha}\right)(x)\\ &=& \sum_{i=0}^{N-1} c_i \left(\delta_{x_i}*\mathscr{S}_{\alpha}\right)(x)\\ &=& \sum_{i=0}^{N-1} c_i \mathscr{S}_{\alpha}(x-x_\alpha)\\ &=& \sum_{i=0}^{N-1} c_i\mathscr{S}\left(\frac{x-x_i}{\alpha}\right). \end{eqnarray}

(6.7)

Since any \({\mathscr{F}_{2,\hat{n}_\varepsilon}}\in \mathscr{N}_{2}(\mathscr{S}; 1, 1)\) is of the form

\[ {\mathscr{F}_{2,\hat{n}_\varepsilon}}(x) = \sum_{j=1}^{\hat{n}}w^{(2)}_{1j}\, \mathscr{S}\left(w^{(1)}_{j1} x - b^{(1)}_j\right) - b^{(2)}_1, x\in \mathbb{R}, \]

By choosing \(\hat{n}_\varepsilon = N\), and \(w^{(1)}_{j1}=\frac{1}{\alpha}\), \(b^{(1)}_{j}=\frac{x_{j-1}}{\alpha}\), \(w^{(2)}_{1j}=c_{j-1}\), and \(b^{(2)}_{1}=0\), \(j=1,2,\ldots, N\), we see that \(s_\varepsilon = {\mathscr{F}_{2,\hat{n}_\varepsilon}}\in \mathscr{N}_{2}(\mathscr{S}; 1, 1)\). This completes the proof.

We also have the universal approximation theorem for sigmoidal networks

Theorem:

[Universal Approximation by Shallow Sigmoidal Networks]

The class \(\mathscr{N}_{2}(\mathscr{S}; 1, 1)\) is a universal approximator of \(\mathbb{C}(\mathbb{R})\).

Proof is left as an exercise.

We can extend the universal approximation property of shallow networks to multi-dimensional, i.e., to the class \(\mathscr{N}_{2}(\mathscr{S}; n_0, 1)\), for any \(n_0\ge 1\). The theorem is called the Cybenko theorem, which we state here and omit the proof for this course.

Theorem:

[Cybenko Theorem]

Let \(\mathscr{S}:\mathbb{R}\rightarrow \mathbb{R}\) be a continuous sigmoidal function and let \(I_{n_0}=[0,1]^{n_0}\). Then, \(\mathscr{N}_{2}(\mathscr{S}; n_0, 1)\) is dense in \(C(I_{n_0})\).

Note:

Proof of the above theorem is omitted for this course. Interested students can refer to Theorem 9.3.6 (page 259) in the following book:

Calin, Ovidiu, Deep Learning Architectures: A Mathematical Approach, Springer, 2020.

Click here to see the details of the book

The following theorem categorizes the activation functions that can lead to universal approximators within shallow networks.

Theorem:

The class of shallow networks \(\mathscr{N}_{2}(\mathscr{A}; n_0, 1)\) is a universal approximator of \(\mathbb{C}(\mathbb{R}^{n_0})\) if and only if \(\mathscr{A}\) is not a polynomial.

Note:

Proof of the above theorem is omitted for this course. Interested students can refer to Theorem 3.8 (page 27) in the following Lecture Notes:

Petersen, Philipp and Zech, Jakob, Mathematical theory of deep learning, 2025.

arXiv:2407.18384 [cs.LG]

Note:

From the above results, we see that depth is not strictly necessary for universal approximation. However, deeper architectures can achieve comparable approximation accuracy with exponentially fewer neurons in certain cases. Hence, depth contributes to efficiency rather than expressivity. Universal approximation results have also been established for deep networks, often with milder conditions on the activation function.

Feature Mapping Perspective

Having discussed the universal approximation capacity of shallow networks and noting that this expressive power also extends to deep networks, we now ask the question of What makes neural networks so expressive? An answer to this question may be to view the hidden layers as performing a nonlinear feature mapping of the input space.

Just for notational convenience, let us restrict our discussions in this subsection to the set \(\mathscr{N}_{2}(\mathscr{A}, n_0, 1)\), which can be extended to any shallow or deep networks in a similar and straightforward way.

Any function \({\mathscr{F}_{2,\hat{n}}}\in \mathscr{N}_{2}(\mathscr{A}, n_0,1)\) is of the form

\[ {\mathscr{F}_{2,\hat{n}}}(\boldsymbol{x}) = \sum_{j=1}^{\hat{n}}w^{(2)}_{1j}\, \mathscr{A}\left(\boldsymbol{w}^{(1)}_{j}\cdot \boldsymbol{x} - b^{(1)}_j\right) - b^{(2)}_1, \boldsymbol{x}\in \mathbb{R}^{n_0}. \]

Let us denote the activation from the hidden layer as

\[ \boldsymbol{\phi}(x) = \left( \mathscr{A}\left(\boldsymbol{w}^{(1)}_{1}\cdot \boldsymbol{x} - b^{(1)}_1\right), \mathscr{A}\left(\boldsymbol{w}^{(1)}_{2}\cdot \boldsymbol{x} - b^{(1)}_2\right), \cdots, \mathscr{A}\left(\boldsymbol{w}^{(1)}_{\hat{n}}\cdot \boldsymbol{x} - b^{(1)}_{\hat{n}}\right) \right). \]

Then, we can write the network function as

\begin{eqnarray} {\mathscr{F}_{2,\hat{n}}}(x) = \boldsymbol{w}^{(2)}\cdot \boldsymbol{\phi}(x) - b^{(2)}_1,~\boldsymbol{x}\in \mathbb{R}^{n_0}. \end{eqnarray}

(6.8)

This shows that the hidden layer plays the role of the feature mapping that transforms the input to a feature space which is taken as the input space for the output layer.

In traditional machine learning, a feature map is often chosen manually or implicitly through a kernel function (as in SVMs). Whereas, neural networks learn the feature map automatically from the dataset by adjusting the parameters of the hidden layer through optimization.

Connection to kernel machines

We now compare the feature-mapping viewpoint of shallow networks with the general form of a kernel machine.

Let us first recall the general form of a kernel machine.

For an input space \(\mathcal{X}\subseteq \mathbb{R}^{n_0}\), let \(\text{𝕜}:\mathcal{X}\times\mathcal{X}\to\mathbb{R}\) be a symmetric positive definite kernel and let \(\mathbb{H}_\text{𝕜}\) be a Hilbert space (called Reproducing Kernel Hilbert Spaces, RKHS) with feature map \(\boldsymbol{\phi}_\text{𝕜}:\mathcal{X}\to\mathbb{H}_\text{𝕜}\) satisfying

\[ \text{𝕜}(\boldsymbol{x}_1,\boldsymbol{x}_2)=\Big\langle\boldsymbol{\phi}_\text{𝕜}(\boldsymbol{x}_1),\boldsymbol{\phi}_\text{𝕜}(\boldsymbol{x}_2)\Big\rangle_{\mathbb{H}_\text{𝕜}}. \]

A linear kernel machine (for supervised learning) produces predictions of the form

\begin{eqnarray} \mathscr{F}(\boldsymbol{x}) &=& \sum_{i=1}^N \alpha_i\,\text{𝕜}(\boldsymbol{x}_i,\boldsymbol{x}) + b\\ &=& \Big\langle \sum_{i=1}^N \alpha_i \boldsymbol{\phi}_\text{𝕜}(\boldsymbol{x}_i),\, \boldsymbol{\phi}_\text{𝕜}(\boldsymbol{x})\Big\rangle_{\mathbb{H}_\text{𝕜}} + b, \end{eqnarray}

(6.9)

where \(\{\boldsymbol{x}_i\}_{i=1}^N\) are the training inputs and the coefficients \(\{\alpha_i\}\) are obtained by minimizing a chosen cost function with penalty.

By taking

\[ \boldsymbol{w} = \sum_{i=1}^N \alpha_i \boldsymbol{\phi}_\text{𝕜}(\boldsymbol{x}_i) \]

we can write the predictions of a linear kernel machine as

\[ \mathscr{F}(\boldsymbol{x}) = \Big\langle \boldsymbol{w},\, \boldsymbol{\phi}_\text{𝕜}(\boldsymbol{x})\Big\rangle_{\mathbb{H}_\text{𝕜}} + b. \]

Note:

Recall, SVM solve a hinge-loss quadratic programming problem whose dual solution yields the representation

\[ f(x)=\sum_i^N \tilde{\alpha}_i K(x_i,x)+b, \]

where \(\tilde{\alpha}_i = \alpha_i y_i,\) for \(i=1,2,\ldots, N.\)

Comparing the above form of kernel machine function with (6.8), we see that neural networks generalize the idea of kernel machines by learning both the feature representation and the output model/layer (which is a linear model in our present case).

From this viewpoint, universal approximation theorems can be reinterpreted as stating that there exists a set of learned features \(\{\phi_j\}_{j=1}^{\hat{n}}\) such that the linear space \(\mathcal{F}:=\text{span}\{\phi_j\}_{j=1}^{\hat{n}}\) is dense (with respect to the uniform norm) in \(\mathbb{C}(\mathcal{X})\) for any compact input domain \(\mathcal{X}\).

Learning dynamics and the Neural Tangent Kernel

In the previous discussion, we viewed shallow neural networks as linear models in a learned feature space. We also saw that kernel machines correspond to linear models in a fixed (possibly infinite-dimensional) feature space determined by a kernel function. We now show that, in a certain infinite-width limit, neural networks themselves naturally induce a kernel, called the Neural Tangent Kernel (NTK), which connects learning dynamics in neural networks with kernel regression.

During training, the parameters are updated by gradient descent update rule

\[ \boldsymbol{\Theta}_{k+1} = \boldsymbol{\Theta}_k - \eta \nabla_{\boldsymbol{\Theta}} \mathcal{C}(\boldsymbol{\Theta}_k), \]

for a chosen learning rate \(\eta > 0\). Each update changes the output function \({\mathscr{F}_{L,\hat{n}}}(\boldsymbol{x}; \boldsymbol{\Theta}_k)\). Linearizing the network function around \(\boldsymbol{\Theta}_k\) for a very small \(\eta>0\), we get

\[ {\mathscr{F}_{L,\hat{n}}}(\boldsymbol{x}; \boldsymbol{\Theta}_{k+1}) \approx {\mathscr{F}_{L,\hat{n}}}(\boldsymbol{x}; \boldsymbol{\Theta}_k) + \big(\nabla_{\boldsymbol{\Theta}} {\mathscr{F}_{L,\hat{n}}}(\boldsymbol{x}; \boldsymbol{\Theta}_k)\big) (\boldsymbol{\Theta}_{k+1} - \boldsymbol{\Theta}_k), \]

where \(\nabla_{\boldsymbol{\Theta}} {\mathscr{F}_{L,\hat{n}}}\) denotes the Jacobian of the network output with respect to the parameters.

Note:

If \({\mathscr{F}_{L,\hat{n}}}\) is a real-valued function (i.e., \(n_L = 1\)), then \(\nabla_{\boldsymbol{\Theta}} {\mathscr{F}_{L,\hat{n}}}\) is a row vector. To be consistent with the convention that vectors are represented as column vectors, it is customary to write \((\nabla_{\boldsymbol{\Theta}} {\mathscr{F}_{L,\hat{n}}})^\top\) in this case.

Using the gradient descent update rule, we get

\[ {\mathscr{F}_{L,\hat{n}}}(\boldsymbol{x}; \boldsymbol{\Theta}_{k+1}) \approx {\mathscr{F}_{L,\hat{n}}}(\boldsymbol{x}; \boldsymbol{\Theta}_{k}) + \big(\nabla_{\boldsymbol{\Theta}} {\mathscr{F}_{L,\hat{n}}}(\boldsymbol{x}; \boldsymbol{\Theta}_{k})\big) \Big(- \eta \nabla_{\boldsymbol{\Theta}} \mathcal{C}(\boldsymbol{\Theta}_{k})\Big), \]

which can be rewritten as

\[ {\mathscr{F}_{L,\hat{n}}}(\boldsymbol{x}; \boldsymbol{\Theta}_{k+1}) - {\mathscr{F}_{L,\hat{n}}}(\boldsymbol{x}; \boldsymbol{\Theta}_{k}) \approx - \eta \big(\nabla_{\boldsymbol{\Theta}} {\mathscr{F}_{L,\hat{n}}}(\boldsymbol{x}; \boldsymbol{\Theta}_{k})\big) \Big(\nabla_{\boldsymbol{\Theta}} \mathcal{C}(\boldsymbol{\Theta}_{k})\Big). \]

Consider a supervised dataset \(\{(\boldsymbol{x}_i, \boldsymbol{y}_i)\}_{i=1}^N\) and the quadratic cost function

\[ \mathcal{C}(\boldsymbol{\Theta}) = \frac{1}{2} \sum_{i=1}^N \left({\mathscr{F}_{L,\hat{n}}}(\boldsymbol{x}_i; \boldsymbol{\Theta}) - \boldsymbol{y}_i\right)^2. \]

The gradient of the cost function in the parameter space is given by

\[ \nabla_{\boldsymbol{\Theta}} \mathcal{C}(\boldsymbol{\Theta}) = \sum_{i=1}^N \nabla_{\boldsymbol{\Theta}} {\mathscr{F}_{L,\hat{n}}}(\boldsymbol{x}_i; \boldsymbol{\Theta})^\top\, \big({\mathscr{F}_{L,\hat{n}}}(\boldsymbol{x}_i; \boldsymbol{\Theta}) - \boldsymbol{y}_i\big). \]

Substituting this expression in the above expression, we get

\[ {\mathscr{F}_{L,\hat{n}}}(\boldsymbol{x}; \boldsymbol{\Theta}_{k+1}) - {\mathscr{F}_{L,\hat{n}}}(\boldsymbol{x}; \boldsymbol{\Theta}_{k}) \approx - \eta \sum_{i=1}^N \big(\nabla_{\boldsymbol{\Theta}} {\mathscr{F}_{L,\hat{n}}}(\boldsymbol{x}; \boldsymbol{\Theta}_{k})\big) \big(\nabla_{\boldsymbol{\Theta}} {\mathscr{F}_{L,\hat{n}}}(\boldsymbol{x}_i; \boldsymbol{\Theta}_{k})\big)^\top \big({\mathscr{F}_{L,\hat{n}}}(\boldsymbol{x}_i; \boldsymbol{\Theta}_{k}) - \boldsymbol{y}_i\big), \]

Define

\[ \text{𝕜}_{\hat{n}}(\boldsymbol{x}_1,\boldsymbol{x}_2;\boldsymbol{\Theta}) = \big(\nabla_{\boldsymbol{\Theta}} {\mathscr{F}_{L,\hat{n}}}(\boldsymbol{x}_1; \boldsymbol{\Theta})\big) \big(\nabla_{\boldsymbol{\Theta}} {\mathscr{F}_{L,\hat{n}}}(\boldsymbol{x}_2; \boldsymbol{\Theta})\big)^\top, \]

which is called the Neural Tangent Kernel (NTK).

Note:

Observe that \(\text{𝕜}_{\hat{n}}(\boldsymbol{x}_1,\boldsymbol{x}_2;\boldsymbol{\Theta})\) is a matrix if \(n_L>1\) and it is a scalar if \(n_L=1\).

The change in the value of the neural network function at a point \(\boldsymbol{x}\) due to the update in the parameters is given (up to first order) in terms of NTK as

\[ {\mathscr{F}_{L,\hat{n}}}(\boldsymbol{x}; \boldsymbol{\Theta}_{k+1}) - {\mathscr{F}_{L,\hat{n}}}(\boldsymbol{x}; \boldsymbol{\Theta}_{k}) \approx - \eta \sum_{i=1}^N \text{𝕜}_{\hat{n}}(\boldsymbol{x},\boldsymbol{x}_i;\boldsymbol{\Theta})\big({\mathscr{F}_{L,\hat{n}}}(\boldsymbol{x}_i; \boldsymbol{\Theta}_{k}) - \boldsymbol{y}_i\big). \]

By taking \(t_{k}=k\eta\) as a discrete time, we can see that the above rule is the explicit Euler discretization of the system of ordinary differential equations

\[ \frac{d}{dt} {\mathscr{F}_{L,\hat{n}}}(\boldsymbol{x};\boldsymbol{\Theta}(t)) = - \sum_{i=1}^N \text{𝕜}_{\hat{n}}(\boldsymbol{x},\boldsymbol{x}_i;\boldsymbol{\Theta}(t))\big({\mathscr{F}_{L,\hat{n}}}(\boldsymbol{x}_i; \boldsymbol{\Theta}(t)) - \boldsymbol{y}_i\big). \]

Since NTK depends on the gradient of \({\mathscr{F}_{L,\hat{n}}}\), it follows that \(\text{𝕜}_{\hat{n}}\) depends implicitly on \({\mathscr{F}_{L,\hat{n}}}\) through its parameter \(\boldsymbol{\Theta}(t)\). Hence, the RHS of the above system is nonlinear in \({\mathscr{F}_{L,\hat{n}}}\), which means that the above system is a nonlinear system of ODEs.

By increasing the hidden layer width \(\hat{n}\), we can show (proof omitted for this course) that for a sufficiently wide network with the standard NTK parameterisation (weights scaled by \(1/\sqrt{\hat n}\)) and random initialization, we have

\[ \nabla_{\boldsymbol{\Theta}} {\mathscr{F}_{L,\hat{n}}}(\boldsymbol{x}; \boldsymbol{\Theta}(t)) \approx \nabla_{\boldsymbol{\Theta}} {\mathscr{F}_{L,\hat{n}}}(\boldsymbol{x}; \boldsymbol{\Theta}(0)). \]

where the approximation improves as \(\hat n\to\infty\) (in probability, and uniformly for \(t\) in any fixed interval \([0,T]\)), under mild regularity assumptions on the activation. Consequently the empirical NTK converges to a deterministic, time-invariant kernel \(\text{𝕜}_\infty\) and becomes independent of \(t\). That is,

\[ \text{𝕜}_{\hat n}(\boldsymbol{x},\boldsymbol{x}';\boldsymbol{\Theta}(t)) \xrightarrow[\hat n\to\infty]{} \text{𝕜}_\infty(\boldsymbol{x},\boldsymbol{x}';\boldsymbol{\Theta}(0)). \]

Therefore, in the infinite-width limit the network function evolves (approximately) according to the linear system

\[ \frac{d}{dt} {\mathscr{F}_{L,\infty}}(\boldsymbol{x};\boldsymbol{\Theta}(t)) = - \sum_{i=1}^N \text{𝕜}_{\infty}(\boldsymbol{x},\boldsymbol{x}_i)\big({\mathscr{F}_{L,\infty}}(\boldsymbol{x}_i; \boldsymbol{\Theta}(t)) - \boldsymbol{y}_i\big), \]

making the training dynamics linear in function space.

Considering the regularized cost function with the \(\ell^2\)-regularization in the function space (rather than in the parameter space) given by

\[ \mathcal{C}(\boldsymbol{\Theta}) = \frac{1}{2} \sum_{i=1}^N \left({\mathscr{F}_{L,\hat{n}}}(\boldsymbol{x}_i; \boldsymbol{\Theta}) - \boldsymbol{y}_i\right)^2 + \frac{\alpha}{2}\|{\mathscr{F}_{L,\hat{n}}}(\cdot; \boldsymbol{\Theta})\|^2, \]

where \(\alpha>0\) is the penalization parameter or ridge coefficient.

Then, the above system takes the form

Assume that the sequence \(\{\boldsymbol{\Theta}_k\}\) generated by the gradient descent update rule converges to \(\boldsymbol{\Theta}^*\) as \(k\rightarrow \infty\). Consequently, as \(t\rightarrow \infty\), we obtain the steady state \(\dfrac{d{\mathscr{F}_{L,\infty}}}{dt} = 0 \), which implies

\[ {\mathscr{F}_{L,\infty}}(\boldsymbol{x}; \boldsymbol{\Theta}^*) = \frac{1}{\alpha}\sum_{i=1}^N \text{𝕜}_{\infty}(\boldsymbol{x},\boldsymbol{x}_i)\big( \boldsymbol{y}_i -{\mathscr{F}_{L,\infty}}( \boldsymbol{x}_i; \boldsymbol{\Theta}^*) \big). \]

Taking

\[ \boldsymbol{\beta}_i = \frac{\big( \boldsymbol{y}_i -{\mathscr{F}_{L,\infty}}( \boldsymbol{x}_i; \boldsymbol{\Theta}^*) \big)}{\alpha}\in \mathbb{R}^{n_L}, ~i=1,2,\ldots, N, \]

we can write

\[ {\mathscr{F}_{L,\infty}}(\boldsymbol{x}; \boldsymbol{\Theta}^*) = \sum_{i=1}^N \text{𝕜}_{\infty}(\boldsymbol{x},\boldsymbol{x}_i)\boldsymbol{\beta}_i. \]

Let us use the notation

\[ \boldsymbol{\beta}_{\mathrm{vec}} = \begin{bmatrix} \boldsymbol{\beta}_1 \\ \boldsymbol{\beta}_2 \\ \vdots \\ \boldsymbol{\beta}_N \end{bmatrix} \in\mathbb{R}^{N n_L}, \]

where each coordinate \(\boldsymbol{\beta}_i\) is a block column vector, and for the given input set \(\mathcal{X} = \{\boldsymbol{x}_i\}_{i=1}^N,\) let

\[ K_\infty(\mathcal{X},\mathcal{X}) = \big(\text{𝕜}_\infty(\boldsymbol{x}_i, \boldsymbol{x}_j)\big)_{i,j=1}^N \]

be the \(N\times N\) symmetric block Gram matrix whose \((i,j)\)-th block is the Gram matrix \(\text{𝕜}_\infty(\boldsymbol{x}_i, \boldsymbol{x}_j)\). With these notations, we can write

\[ {\mathscr{F}_{L,\infty}}(\boldsymbol{x}; \boldsymbol{\Theta}^*) = K_\infty(\boldsymbol{x},\mathcal{X}) \boldsymbol{\beta}_{\mathrm{vec}}, \]

where

\[ K_\infty(\boldsymbol{x},\mathcal{X}) =\big( \text{𝕜}_\infty(\mathbf{x},\mathbf{x}_1), \text{𝕜}_\infty(\mathbf{x},\mathbf{x}_2), \cdots, \text{𝕜}_\infty(\mathbf{x},\mathbf{x}_N) \big)\in \mathbb{R}^{n_L\times N n_L} \]

is the block row vector with each component being the \(n_L\times n_L\) matrix.

The above expression shows that \({\mathscr{F}_{L,\infty}}\) lies in the span of the kernel functions \(\{ \text{𝕜}_\infty(\cdot, \mathbf{x}_i) \}_{i=1}^N\), i.e., \({\mathscr{F}_{L,\infty}}\) resembles a kernel machine.

Physics Informed Neural Networks for ODEs

Feedforward neural networks can be used to learn solutions of initial and/or boundary value problems for differential equations (ODEs and PDEs). In this section, we restrict our discussions to initial value problem for first-order ordinary differential equations (ODEs) of the form

\begin{eqnarray} \begin{aligned} \frac{dy}{dt} &= f\big(t,y\big), \qquad t\in[0,T],\\ y(0) &= y_0, \end{aligned} \end{eqnarray}

(6.10)

where \(y=y(t)\) is the solution for a given (sufficiently regular) function \(f:[0,T]\times\mathbb{R}\to\mathbb{R}\), for some \(T>0\), and \(y_0\in\mathbb{R}\) is the prescribed initial value. We seek an approximation \(\widehat{y}(t)\approx y(t)\) for \(t\in[0,T].\)

We briefly discuss how a feedforward neural network (FNN) can be used to obtain an approximation \(\widehat{y}(\cdot)={\mathscr{F}_{L,\hat{n}}}(\cdot;\boldsymbol{\Theta})\).

The mapping \(t\mapsto {\mathscr{F}_{L,\hat{n}}}(t;\boldsymbol{\Theta})\) is chosen smooth enough so that time derivatives of the network output are defined via automatic differentiation.

Trial Solution

There are two ways that we can define the trial solution.

Hard-constrained Formulation:

To enforce the initial condition exactly, define the trial solution

\begin{eqnarray} \widetilde{y}(t;\boldsymbol{\Theta}) := y_0 + t\,{\mathscr{F}_{L,\hat{n}}}(t;\boldsymbol{\Theta}). \end{eqnarray}

(6.11)

By construction \(\widetilde{y}(0;\boldsymbol{\Theta})=y_0\) for every choice of parameters \(\boldsymbol{\Theta}\). Therefore, the neural network function \({\mathscr{F}_{L,\hat{n}}}(t;\boldsymbol{\Theta})\) is the corrective term scaled by \(t\), and \(\widetilde{y}\) is the required approximate solution of the given IVP which satisfies the initial condition exactly.

Soft-constrained Formulation:

Here, the trial solution itself is represented by the neural network function

\begin{eqnarray} \widetilde{y}(t;\boldsymbol{\Theta}) :={\mathscr{F}_{L,\hat{n}}}(t;\boldsymbol{\Theta}) \end{eqnarray}

(6.12)

and the initial condition is not enforced exactly rather imposed through a penalty term in the cost function. Thus, in this case, the neural network function is the required approximate solution of the given IVP which satisfies the initial condition approximately.

Residual and Training Loss

Define the pointwise residual loss function of the ODE for the trial solution as

\begin{eqnarray} \mathcal{R}(t;\boldsymbol{\Theta}) := \frac{d}{dt}\widetilde{y}(t;\boldsymbol{\Theta}) - f\big(t,\widetilde{y}(t;\boldsymbol{\Theta})\big). \end{eqnarray}

(6.13)

Select a set of collocation points \(\{t_j\}_{j=0}^N \subset [0,T]\) with \(t_0=0\) and \(\Delta t_j = t_j-t_{j-1}\), \(j=1,2,\ldots, N\). The mean squared residual cost function is defined as

\begin{eqnarray} \mathcal{C}_{\texttt{res}}(\boldsymbol{\Theta}) \;=\; \sum_{i=1}^N \big|\mathcal{R}(t_i;\boldsymbol{\Theta})\big|^2\Delta t_i. \end{eqnarray}

(6.14)

The collocation points may be chosen uniformly, randomly, or using adaptive strategies. The residual cost function ensures that the neural network satisfies the ODE (6.10) approximately. We also need to enforce the initial condition \(y(0)=y_0\), which can be done by adding a penalty corresponding to the initial condition as

\begin{eqnarray} \mathcal{L}_{\texttt{ic}}(\boldsymbol{\Theta}) = \left|\widetilde{y}(0;\boldsymbol{\Theta}) - y_0 \right|^2. \end{eqnarray}

(6.15)

Finally, the total cost function is given by

\begin{eqnarray} \mathcal{C}(\boldsymbol{\Theta}) = \mathcal{C}_{\texttt{res}}(\boldsymbol{\Theta}) + \lambda\, \mathcal{L}_{\texttt{ic}}(\boldsymbol{\Theta}), \end{eqnarray}

(6.16)

where \(\lambda > 0\) is a weighting parameter controlling the relative importance of satisfying the initial condition.

Remark:

Differentiating \(\widetilde{y}(t;\boldsymbol{\Theta})\) in the residual loss function is equivalent to differentiating the neural network function \({\mathscr{F}_{L,\hat{n}}}(t;\boldsymbol{\Theta})\), which we need to compute as a part of the optimization method. There are two ways that we can compute \(\frac{d}{dt}{\mathscr{F}_{L,\hat{n}}}(t;\boldsymbol{\Theta})\).

We can perform exact computation of \(\frac{d}{dt}{\mathscr{F}_{L,\hat{n}}}(t;\boldsymbol{\Theta})\) using automatic differentiation (not covered in this course) of the FNN.
We can also approximate \(\frac{d}{dt}\widetilde{y}(t;\boldsymbol{\Theta})\) at the collocation points using backward difference operator given by
\[ \frac{d}{dt}\widetilde{y}(t;\boldsymbol{\Theta}) \approx \frac{\widetilde{y}(t_{k};\boldsymbol{\Theta}) - \widetilde{y}(t_{k-1};\boldsymbol{\Theta})}{\Delta t_i}. \]

Note:

The ANN-based ODE solver discussed here is a special case of the well-known Physics-Informed Neural Networks (PINN). The essence of PINNs is to incorporate the underlying physics governed by the differential equation through the residual cost function, along with the associated initial and/or boundary conditions.

Extending the ideas of PINNs from ODEs to PDEs is not a straight forward task. However, in certain simple PDE problems, the PINN formulation is simple.

Problem:

Obtain the total cost function for the soft-constrained formulation of the following initial and value problem:

\[ \begin{array}{cc} \dfrac{\partial u}{\partial t} + \dfrac{\partial u}{\partial x} = 0,&(x,t)\in [a,b]\times [0,T],\\ u(x,0) = u_0(x),&x\in [a,b],\\ u(a,t) = u_0(a), & t\in[0,T]. \end{array} \]

Hint:

First write the trial function \(\tilde{u}(x,t;\boldsymbol{\theta})\).

Write the PDE residual loss function.

Write the loss functions for initial and boundary conditions, separately.

Introduce the two dimensional collocation points for \([a,b]\times [0,T]\).

Write the PDE residual cost function on the collocation points (with double sum).

Write the initial and boundary cost functions on the collocation points (separately, in space and time variables).

Finally, write the total cost function.

Training Algorithm

The training goal is to find a parameter set \(\boldsymbol{\Theta}^\star\) such that

\[ \boldsymbol{\Theta}^\star = \text{argmin}_{\boldsymbol{\Theta}} \, \mathcal{C}(\boldsymbol{\Theta}). \]

As an exact solution cannot be obtained for the above unconstrained optimization problem, we may use any of the standard gradient-based optimizers (like mini-batch GD, SGD with momentum or Adam) to obtain an approximation to \(\boldsymbol{\Theta}^\star\).

Algorithm:

Input:

Choose FNN architecture for \({\mathscr{F}_{L,\hat{n}}}\) (number of layers, neurons, activation functions).

Select collocation points \(\{t_j\}_{j=0}^N\) in \([0,T]\).

Choose an initial parameter set \(\boldsymbol{\Theta}_0\).

Processing:

For \(k=1,2,\ldots\), do the following steps:

Form the trial solution \(\widetilde{y}(t_k;\boldsymbol{\Theta}_k)=y_0+t_k\,{\mathscr{F}_{L,\hat{n}}}(t_k;\boldsymbol{\Theta}_k)\).
Compute residuals \(R(t_j;\boldsymbol{\Theta}_k)\), \(j=1,2,\ldots, N\), at collocation points using (6.13).
Compute the residual cost \(\mathcal{L}(\boldsymbol{\Theta}_k)\) using (6.14).
Compute the penalty term \(\mathcal{L}_{\texttt{ic}}(0;\boldsymbol{\Theta}_k)\) using (6.15)
Update \(\boldsymbol{\Theta}_k\) by gradient-based optimisation.
Check for the regularization condition. If satisfied, break the process. If not, then increate \(k\) by one and repeat the above steps.

Output: Take \(\boldsymbol{\Theta}^\star = \boldsymbol{\Theta}_k\), which the parameter set returned by the optimization method. Return \(\widetilde{y}(t;\boldsymbol{\Theta}^\star)\) as the approximate solution.

Code:

The following code illustrate the training of a PINN:

PINNLinearAdvection

The code is developed using PyTorch. Go through the code carefully and understand it.

The same can be developed using TensorFlow.

The output of the code is shown below, where the PINN solution is compared with the exact solution of the given initial and boundary value problem for the linear advection equation.

Research Perspectives

Developing suitable architectures and efficient learning procedures for PINNs and their variants is an active research area. They form an important part of a growing branch of Machine Learning, known as Scientific Machine Learning (SciML). The importance of SciML lies in its wide range of applications, such as

fluid mechanics (incompressible and compressible Navier–Stokes equations, and turbulence modeling),

solid mechanics and material modeling,

inverse problems (parameter identification),

computational finance (stochastic differential equations, option pricing, and portfolio optimization formulated via stochastic control or Hamilton–Jacobi–Bellman equations),

Quantum mechanics (Schrödinger-type equations),

and so on.

The development and analysis of PINNs and their variants broadly goes through the following steps:

identifying the appropriate functional setting (solution space) for the giving problem (initial and/or boundary value problem),

designing a suitable architecture to learn an approximate solution of the problem,

analyzing the expressivity of the network class (similar to universal approximation results) in the solution space,

devising efficient and stable learning methodologies, and

performing numerical experiments to demonstrate the performance of the learned model, and

performing error analyses (theoretically and numerically) to show how well the trained neural network approximates a solution of the problem.

At present, there is active research towards using the Neural Tangent Kernel (NTK) framework to analyze, and in some cases improve, the training dynamics and generalization properties of various types of neural networks, including transformers (to a certain extent), and in particular, of PINNs.

The NTK perspective is mainly used in analyzing the convergence behavior, spectral bias, and optimization dynamics of neural networks. Thus, NTK may be used to bridge the gap between theoretical analysis and practical implementation.