Main Page

Special Architectures

So far, we have studied multi-layer perceptrons (MLPs), which are fully connected (dense) feedforward neural networks. We have also seen that these networks perform well for a verity of pattern recognition problems that can be viewed as function approximation tasks that are consistent with their property as universal approximators. We also extended our study to some recent topics, exploring their training dynamics through the Neural Tangent Kernel (NTK), and their role in scientific computing through Physics-Informed Nerual Networks (PINNs).

In this chapter, we focus on some special architectures that are designed for specific types of data or applications, like convolution neural networks (CNNs), which are particularly suitable for image recognition, and recurrent neural networks (RNNs) suitable for language and sequence modelling. We conclude the chapter with a discussion on a recent development called Transformers, which works through attention-based architecture that has revolutionized sequence modelling. Transformers provide a unified framework for learning from text, images, and beyond.

Convolution Neural Network

The dense structure of MLPs treat each input neuron independently and ignore spatial structure of the input data. For example, in image data split into pixels, nearby pixels are highly correlated and the certain local features like edges or corners can appear at different locations in the image. This also leads to a very large number of parameters which are still poorly scalable for high-resolution images. Convolution Neural Networks (CNNs) overcome these limitations by incorporating local spatial structure of data through local connectivity and weight sharing.

In local connectivity, each neuron is connected only to a small spatial region (patch) of the input in order to capture local patterns. Through weight sharing, the same set of weights, referred to as a filter or kernel, is used across different spatial locations. This enables translation invariance, which helps in reducing the number of parameters. Pooling layers are often added to reduce spacial dimensions and to retain the most important features. In this section, we briefly outline CNNs.

Networks of One-Dimensional Signals

In many applications, such as time-series analysis, speech recognition, or any sequence-based data, the input is naturally one-dimensional. For such data, one-dimensional convolutional neural networks (1D CNNs) provide an efficient architecture that captures local dependencies along the temporal or sequential axis.

Let \(f, g : \mathbb{R} \to \mathbb{R}\). The continuous one-dimensional convolution of \(f\) and \(w\), denoted by \((f * g)\), is defined as

\begin{eqnarray} (f * g)(x) = \int_{\mathbb{R}} f(\xi)\, g(x - \xi)\, d\xi, \end{eqnarray}
(7.1)

whenever the integral exists.

Analogously, in the discrete case where the signals are sequences, say, \(f,g:\mathbb{Z}\rightarrow \mathbb{R}\) given by

\[ f = \{f_i\}_{i \in \mathbb{Z}}, \qquad g = \{g_i\}_{i \in \mathbb{Z}}, \]

the discrete convolution is defined by

\[ (f * g)_i = \sum_{j \in \mathbb{Z}} f_j \, g_{i - j} . \]

Remark:
[Finitely Supported Signals]

In practice, we have signals of finite length. In this case, we extend the signal by zero, which we refer to as signal with finite support. For instance, if we have

\[ \boldsymbol{g}=[g_0, g_1, \ldots, g_{N-1}], \]

for some \(N>0\), then it should be understood as extending it to zero on either side and considering the signal of infinite length \(g=[\ldots, 0, \boldsymbol{g}, 0, \ldots]\).

In CNNs, it is customary to use the cross-correlation operation (instead of strict convolution), given by

\[ (\boldsymbol{w} * \boldsymbol{g})_i = \sum_{j = 0}^{r-1} w_j\, g_{i + j}, ~~ i = 0, 1, \ldots, N - r. \]

where \(\boldsymbol{w} = [w_0, w_1, \ldots, w_{r-1}]\) is the kernel or filter of size \(r\) and \(\boldsymbol{g}=[g_0, g_1, \ldots, g_{N-1}]\) is finite input signal with \(0 < r \le N\).

Example:
[Moving Average as a Convolution Operation]

Let \( \boldsymbol{x} = [x_1,\, x_2,\, \ldots,\, x_N] \) be a discrete-time signal (or sequence). A moving average with window size \(r\) computes each output value as the average of \(r\) consecutive inputs, given by

\[ y_i = \frac{1}{r}\sum_{j=0}^{r-1} x_{i+j}, \quad i = 1, 2, \ldots, N - r + 1. \]

This can be viewed as a cross-correlation (or convolution without kernel flipping) between the input \(\boldsymbol{x}\) and the kernel

\begin{eqnarray} \boldsymbol{w} = \frac{1}{r}[1,\, 1,\, \ldots,\, 1], \end{eqnarray}
(7.2)

that is,

\begin{eqnarray} y_i = \sum_{j=0}^{r-1} w_j\, x_{i+j}. \end{eqnarray}
(7.3)

Thus, the moving average is a special case of convolution where all kernel weights are equal and sum to one.

The moving average smooths the signal by reducing local fluctuations and it acts as a low-pass filter, retaining slow variations in \(\boldsymbol{x}\) and attenuating rapid changes.

The one-dimensional moving average filter discussed above can be viewed as a low-pass filter, since it smooths rapid variations in the input signal. In two dimensions, the same idea extends naturally to image processing, where a smoothing kernel corresponds to blurring an image.

In contrast, edge detection in images corresponds to emphasizing changes (differences) in the signal, i.e., high-frequency components. In one dimension, this can be seen directly through convolution with a difference kernel or high-pass filter.

Example:
[One-Dimensional Edge Detection]

Consider a one-dimensional input signal \( \boldsymbol{x} = [x_1,\, x_2,\, \ldots,\, x_N] \) and a difference kernel (or edge detection kernel)

\[ \boldsymbol{w} = [-1, 1] \]

The convolution (or, in CNNs, the cross-correlation) of \(\boldsymbol{x}\) and \(\boldsymbol{w}\) produces the output

\[ y_i = (-1)\,x_i + (1)\,x_{i+1} = x_{i+1} - x_i, \qquad i = 1, 2, \ldots, 5. \]

Hence,

\[ \boldsymbol{y} = [x_2 - x_1,\, x_3 - x_2,\, x_4 - x_3,\, x_5 - x_4,\, x_6 - x_5]. \]

This output measures the change (or discrete gradient) between successive input values. Regions where \(\boldsymbol{x}\) is approximately constant produce small responses, while sharp transitions yield large responses. Hence, this corresponding to edges in one-dimensional data.

In two dimensions, similar difference kernels (e.g., Sobel or Prewitt filters) are used to detect edges in images.

One-Dimensional Convolutional Layer

Let the input to a 1D convolutional layer be a sequence (or signal)

\[ \boldsymbol{x} = [x_1, x_2, \ldots, x_N] \in \mathbb{R}^N. \]

and let the convolutional kernel (or filter) be

\[ \boldsymbol{w} = [w_1, w_2, \ldots, w_r] \in \mathbb{R}^r, \]

where the integer \(1\le r\le N\) is called the kernel size or receptive field length.

Definition:
[Convolutional Layer]

For a given integer \(s \ge 1\), a convolutional layer with activation function \(\mathscr{A}\), convolution kernel \(\boldsymbol{w}\) and bias \(b\) produces the output

\begin{eqnarray} a_i = \mathscr{A}\!\Bigg(\sum_{j=1}^{r} w_j\, x_{(i-1)s + j} - b\Bigg), \quad i = 1, 2, \ldots, n_{\text{out}}, \end{eqnarray}
(7.4)

where

\begin{eqnarray} n_{\text{out}} = \Big\lfloor \frac{N - r}{s} \Big\rfloor + 1. \end{eqnarray}
(7.5)

The number \(s\) is called the stride, which specifies the step size with which the kernel moves along the input.

Note:
The bias term is written with a negative sign for consistency with our earlier definition of affine neurons. This convention is immaterial in practice, since \(b\) is a learnable parameter and its sign can be absorbed into its value.

The convolution layer replaces the affine transformation \(\boldsymbol{w}\cdot\boldsymbol{x} - b\) of a neuron in an MLP by the local convolution operation \(\boldsymbol{w} * \boldsymbol{x} - b\). Unlike an MLP layer, a convolution layer is not organized neuron-wise, since each output activation (feature map entry) depends only on a local receptive field of the input rather than the entire input vector. Nevertheless, the convolution operation can be equivalently expressed as a matrix-vector multiplication, where the kernel \(\boldsymbol{w}\) is expanded into a structured (sparse Toeplitz) matrix. Thus, a convolutional layer can be viewed as a special kind of fully connected layer.

Problem:
[Convolution Layer as a Fully Connected Layer]

Show that a one-dimensional convolutional layer can be expressed as a fully connected (dense) layer with a specially structured weight (sparse Toeplitz) matrix.

Example:
[Convolutional Layer with Stride \(s=1\)]

Consider an input sequence

\[ \boldsymbol{x} = [x_1,\, x_2,\, x_3,\, x_4,\, x_5,\, x_6], \]

and a kernel (filter)

\[ \boldsymbol{w} = [w_1,\, w_2], \]

with kernel size \(r = 2\), stride \(s = 1\), and bias \(b\).

According to the convolutional layer definition,

\begin{eqnarray} a_i = \mathscr{A}\!\Bigg(\sum_{j=1}^{r} w_j\, x_{(i-1)s + j} - b\Bigg), \quad i = 1, 2, \ldots, n_{\text{out}}, \end{eqnarray}
(7.6)

where

\begin{eqnarray} n_{\text{out}} = \Big\lfloor \frac{N - r}{s} \Big\rfloor + 1 = \Big\lfloor \frac{6 - 2}{1} \Big\rfloor + 1 = 5. \end{eqnarray}
(7.7)

Thus, the layer output activations are

\begin{eqnarray} \begin{aligned} a_1 &= \mathscr{A}(w_1 x_1 + w_2 x_2 - b),\\ a_2 &= \mathscr{A}(w_1 x_2 + w_2 x_3 - b),\\ a_3 &= \mathscr{A}(w_1 x_3 + w_2 x_4 - b),\\ a_4 &= \mathscr{A}(w_1 x_4 + w_2 x_5 - b),\\ a_5 &= \mathscr{A}(w_1 x_5 + w_2 x_6 - b). \end{aligned} \end{eqnarray}
(7.8)

Hence, the output feature map (activation vector) of the convolution layer is

\begin{eqnarray} \boldsymbol{a} = [a_1,\, a_2,\, a_3,\, a_4,\, a_5]. \end{eqnarray}
(7.9)

Problem:
Consider an input vector \(\boldsymbol{x} = [x_1, x_2, x_3, x_4, x_5, x_6]\), a kernel (filter) \(\boldsymbol{w} = [w_1, w_2]\), and a bias term \(b\). Using the convolutional layer with stride \(s = 2\) and activation function \(\mathscr{A}\), compute the output activations.

Remark:
[Padding]

The number of zeros added to both ends of the input sequence before performing convolution is called padding. Padding ensures that edge elements receive equal treatment and can preserve the original input length. If we pad with \(p\) zeros on each side, the effective input length becomes \(N + 2p\).

Given an input of length \(N\), a kernel of size \(r\), a stride \(s\), and padding \(p\), the convolutional layer is given by

\begin{eqnarray} a_i = \mathscr{A}\!\Bigg(\sum_{j=1}^{r} w_j\, x_{(i-1)s + j - p} - b\Bigg), \quad i = 1, 2, \ldots, n_{\text{out}}, \end{eqnarray}
(7.10)

where

\begin{eqnarray} n_{\text{out}} = \Bigg\lfloor \frac{N + 2p - r}{s} \Bigg\rfloor + 1. \end{eqnarray}
(7.11)

Problem:
Consider an input vector \(\boldsymbol{x} = [x_1, x_2, x_3, x_4, x_5, x_6]\), a kernel \(\boldsymbol{w} = [w_1, w_2]\), and a bias term \(b\). Using the convolutional layer with stride \(s = 1\), activation function \(\mathscr{A}\), and padding of one zero on each side (i.e., \(p=1\)), compute the output activations.

Remark:
[Pooling Layer]

After a convolutional layer produces its output (called the feature map), it is often followed by a pooling layer. Given an input feature map

\begin{eqnarray} \boldsymbol{z} = [z_1, z_2, \ldots, z_N], \end{eqnarray}
(7.12)

a max-pooling layer with pooling window of size \(p\) and stride \(s_p\) produces the output

\begin{eqnarray} z_i^{(\text{pool})} = \max_{j \in \mathcal{N}(i)} z_j, \quad i = 1, 2, \ldots, n_{\text{pool}}, \end{eqnarray}
(7.13)

where \(\mathcal{N}(i)\) denotes a small neighborhood of indices around \(i\), typically

\begin{eqnarray} \mathcal{N}(i) = \{(i-1)s_p + 1, (i-1)s_p + 2, \ldots, (i-1)s_p + p\}, \end{eqnarray}
(7.14)

and

\begin{eqnarray} n_{\text{pool}} = \Big\lfloor \frac{N - p}{s_p} \Big\rfloor + 1. \end{eqnarray}
(7.15)

If the maximum operation is replaced by the average, we obtain the average pooling layer. Similarly, if the maximum operation is replaced by the minimum operation, then we obtain the min-pooling layer.

Here, \(\boldsymbol{z}\) denotes the output feature map obtained from the preceding convolutional layer. The pooling operation acts on \(\boldsymbol{z}\) to produce a lower-dimensional representation \(\boldsymbol{z}^{(\text{pool})}\), which can be fed to the next convolutional or fully connected layer. Schematically, we have Thus, pooling serves two main purposes, namely,

  • It reduces the spatial resolution (or length) of the feature map, thereby decreasing the number of learnable parameters and computational cost.
  • It increases translational invariance, meaning that small shifts or distortions in the input do not significantly change the pooled output.
  • Example:
    [1D Max Pooling]

    Let the convolution output be

    \begin{eqnarray} \boldsymbol{z} = [z_1, z_2, z_3, z_4, z_5, z_6], \end{eqnarray}
    (7.16)

    and consider max pooling with pooling size \(p=2\) and stride \(s_p=2\). Then,

    \begin{eqnarray} \begin{aligned} z^{(\text{pool})}_1 &= \max(z_1, z_2),\\ z^{(\text{pool})}_2 &= \max(z_3, z_4),\\ z^{(\text{pool})}_3 &= \max(z_5, z_6). \end{aligned} \end{eqnarray}
    (7.17)

    Hence, the pooled output is

    \begin{eqnarray} \boldsymbol{z}^{(\text{pool})} = [z^{(\text{pool})}_1,\, z^{(\text{pool})}_2,\, z^{(\text{pool})}_3], \end{eqnarray}
    (7.18)

    which has reduced length compared to the original feature map.

    Let us now illustrate translational invariance through pooling.

    Example:
    Consider the convolutional output

    \begin{eqnarray} \boldsymbol{z} = [2,\, 5,\, 3,\, 4]. \end{eqnarray}
    (7.19)

    Applying max pooling with window size \(2\) and stride \(2\) gives

    \begin{eqnarray} \boldsymbol{z}^{(\text{pool})} = [\max(2,5),\, \max(3,4)] = [5,\, 4]. \end{eqnarray}
    (7.20)

    If the input is slightly shifted, producing

    \begin{eqnarray} \boldsymbol{z}' = [1,\, 5,\, 4,\, 3], \end{eqnarray}
    (7.21)

    then the pooled output remains

    \begin{eqnarray} \boldsymbol{z}^{\prime(\text{pool})} = [\max(1,5),\, \max(4,3)] = [5,\, 4]. \end{eqnarray}
    (7.22)

    Hence, pooling increases translational invariance, meaning that small shifts in the input signal do not significantly affect the pooled output.

    Two-Dimensional Convolutional Layer

    We now extend the one-dimensional convolutional layer to the two-dimensional case, which is the core operation in convolutional neural networks (CNNs) for image processing.

    Convolution Operator

    In this subsection, we briefly recall the basic definitions concerning convolution. We first start with convolution of two functions from \(\mathbb{R}^2\) to \(\mathbb{R}\).

    Definition:
    [Convolution of Functions]

    Let \(f, g : \mathbb{R}^2 \to \mathbb{R}\). The convolution of \(f\) and \(g\), denoted by \((f * g)\), is defined by

    \[ (f * g)(x, y) = \int_{\mathbb{R}^2} f(\xi, \eta)\, g(x - \xi,\, y - \eta)\, d\xi\, d\eta, \]

    whenever the integral exists.

    The continuous definition of the convolution operator can be restricted to the discrete case, leading to the definition of convolution between two matrices.

    Definition:
    [Convolution of Matrices]

    Let \(A = [a_{i,j}] \in \mathbb{R}^{m \times n}\) and \(K = [k_{p,q}] \in \mathbb{R}^{r \times s}\). The discrete convolution of \(A\) and \(K\), denoted by \((K * A)\), is defined by

    \[ (K * A)_{i,j} = \sum_{p=0}^{r-1} \sum_{q=0}^{s-1} k_{p,q}\, a_{\,i - p,\, j - q}, \]

    where \(a_{i - p,\, j - q} = 0\) whenever the indices fall outside the range of \(A\).

    CNNs typically use the cross-correlation operation given by

    \[ (K * A)_{i,j} = \sum_{p=0}^{r-1} \sum_{q=0}^{s-1} k_{p,q}\, a_{\,i + p,\, j + q}. \]

    Note:
    Often in machine learning context, the above definition itself is referred to as convolutions and we also follow the same here.

    Remark:
    The matrix \(A\) can be viewed as an infinite matrix

    \[ A = [a_{i,j}]_{i,j=-\infty}^\infty \]

    with compact support, which means that there exists an integer \(r \ge 1\) and \(c \ge 1\) such that \(a_{i,j}=0\) for all \(|i| \ge r\) and \(|j| \ge c\). In this case, we say that \(W\) has an \(r\times c\) support.

    Example:
    [Moving average]

    Given a two dimensional discrete signal \(A\), the \(2\times 2\) supported kernel that produces a simple moving average is given by

    \[ W = \left[\begin{array}{cccccc} \tfrac{1}{4}&\tfrac{1}{4}\\ \tfrac{1}{4}&\tfrac{1}{4}\\ \end{array}\right], \]

    where \(w_{0,0}=w_{0,1}=w_{1,0}=w_{1,1}=1/4\), say, and the matrix is extended by zeros in all four directions. Then the cross-correlation leads to the simple moving average

    \[ (W * A)_{i,j} = \sum_{p=0}^{r-1} \sum_{q=0}^{s-1} w_{p,q}\, a_{\,i + p,\, j + q} = \frac{1}{4}\big(a_{i,j} + a_{i+1,j} + a_{i,j+1} + a_{i+1,j+1} \big). \]

    This operation replaces each pixel value by the average of its four neighboring pixels, thereby producing a blurred or smoothed version of the image.

    Averaging replaces each pixel by the mean of its local neighborhood, which suppresses rapid spatial variations (high-frequency components) while preserving slowly varying content (low-frequency components). Consequently, sharp transitions (edges) and noise are attenuated, yielding a blurred (smoothed) image. This relates it to the concept of a low-pass filter.

    Problem:
    [Blurring Effect of a Mean Filter]

    Consider a \(3\times3\) mean filter (or box filter) kernel

    \[ W = \frac{1}{9} \begin{bmatrix} 1 & 1 & 1\\ 1 & 1 & 1\\ 1 & 1 & 1 \end{bmatrix}. \]

    Show that the operation produces a blurred image (equivalently, show that the kernel \(W\) is a low-pass filter.

    Hint:
  • Write down the expression for \((W * X)_{i,j}\), where \(X = [x_{p,q}]\) is a two-dimensional input signal (or image) and show that each output pixel \(a_{i,j}\) is the average of the \(3\times3\) neighborhood of \(\mathbf{X}\) centered at \((i,j)\).
  • Explain why this operation produces a blurred image by giving an argument similar to the one given in the above example.
  • Example:
    [Gaussian Blurring]

    A smoother and more natural way to blur an image is to use a Gaussian blur kernel, whose entries are proportional to a discrete approximation of a two-dimensional Gaussian function. A commonly used \(3\times3\) Gaussian blur kernel is

    \[ W_{\text{G}} = \frac{1}{16} \begin{bmatrix} 1 & 2 & 1\\ 2 & 4 & 2\\ 1 & 2 & 1 \end{bmatrix}. \]

    Convolving an image \(X\) with \(W_{\text{G}}\) produces the output

    \[ (W_{\text{G}} * X)_{i,j} = \frac{1}{16} \sum_{p=-1}^{1}\sum_{q=-1}^{1} c_{p,q}\, x_{i+p,\,j+q}, \]

    where \(c_{p,q}\) are the corresponding integer weights shown above.

    The weights are largest at the center and decrease symmetrically outward, so this filter performs a weighted average that preserves overall image brightness while reducing noise and small-scale variations. Unlike the simple mean filter, the Gaussian kernel avoids sharp artifacts and produces a more visually natural blurred image.

    The mean filter smooths (blurs) an image by averaging nearby pixels, whereas an edge detection filter highlights regions of sharp intensity change as illustrated in the following example.

    Example:
    [Edge Detection via Convolution]

    A simple vertical edge detection kernel is

    \[ W_v = \begin{bmatrix} -1 & 0 & 1\\ -1 & 0 & 1\\ -1 & 0 & 1 \end{bmatrix}, \]

    and a corresponding horizontal edge detection kernel is

    \[ W_h = \begin{bmatrix} -1 & -1 & -1\\ 0 & 0 & 0\\ 1 & 1 & 1 \end{bmatrix}. \]

    For an input image \(X = [x_{i,j}]\), the feature maps are computed as

    \[ A_v = W_v * X, \qquad A_h = W_h * X. \]

    Each of these responds strongly where there is a horizontal or vertical edge, respectively.

    The matrices \(W_v\) and \(W_h\) are called Sobel filters (or sometimes Prewitt filters, depending on scaling).

    The overall edge magnitude can be approximated by combining both directions:

    \[ A_{\text{edge}} = \sqrt{A_x^2 + A_y^2}. \]

    2D Convolution Layer

    Let us restrict to grayscale images, where the input and the kernel are two-dimensional arrays (matrices).

    Let the input to the layer be a two-dimensional array

    \[ X = [x_{i,j}]_{i=1,j=1}^{n_{0}^{(h)},n_{0}^{(w)}} \in \mathbb{R}^{n_{0}^{(h)}\times n_{0}^{(w)}}, \]

    where \(n_{0}^{(h)}\) and \(n_{0}^{(w)}\) denote the height and width of the input.

    Let the convolutional kernel (or filter) be

    \[ W = [w_{p,q}]_{p=1,q=1}^{r^{(h)}_1, r^{(h)}_1} \in \mathbb{R}^{r^{(h)}_1\times r^{(h)}_1}, \]

    where \(r^{(h)}_1\) and \(r^{(h)}_1\) denote the kernel height and width, respectively. Let \(b \in \mathbb{R}\) be the bias term and \(\mathscr{A}\) be a nonlinear activation function.

    Definition:
    [Two-Dimensional Convolutional Layer]

    For given stride values \(s_h, s_w \ge 1\) in the vertical and horizontal directions, respectively, the convolutional layer produces the output

    \[ a_{i,j} = \mathscr{A}\!\Bigg( \sum_{p=1}^{r^{(h)}_1}\sum_{q=1}^{r^{(w)}_1} w_{p,q}\, x_{(i-1)s_h + p,\,(j-1)s_w + q} - b \Bigg), \quad i=1,2,\ldots,n_{\text{out}}^{(h)}, \; j=1,2,\ldots,n_{\text{out}}^{(w)}, \]

    where

    \[ n_{\text{out}}^{(h)} = \Big\lfloor \frac{n_{0}^{(h)} - r^{(h)}_1}{s_h} \Big\rfloor + 1, \qquad n_{\text{out}}^{(w)} = \Big\lfloor \frac{n_{0}^{(w)} - r^{(w)}_1}{s_w} \Big\rfloor + 1, \]

    are the output dimensions. The pair \((s_h, s_w)\) is called the stride.

    Remark:
    [Padding]

    If zero-padding of size \(p_h\) and \(p_w\) is applied to the height and width directions, the effective input dimensions become

    \[ H' = n_{0}^{(h)} + 2p_h, \qquad W' = n_{0}^{(w)} + 2p_w, \]

    and the output dimensions are given by

    \[ n_{\text{out}}^{(h)} = \Big\lfloor \frac{H' - r^{(h)}_1}{s_h} \Big\rfloor + 1, \qquad n_{\text{out}}^{(w)} = \Big\lfloor \frac{W' - r^{(w)}_1}{s_w} \Big\rfloor + 1. \]

    Example:
    [2D Convolution with \(3\times3\) Kernel and Stride \((1,1)\)]

    Consider the input image and the kernel, respectively,

    \begin{eqnarray} \boldsymbol{X} = \begin{bmatrix} x_{1,1} & x_{1,2} & x_{1,3} & x_{1,4} \\ x_{2,1} & x_{2,2} & x_{2,3} & x_{2,4} \\ x_{3,1} & x_{3,2} & x_{3,3} & x_{3,4} \\ x_{4,1} & x_{4,2} & x_{4,3} & x_{4,4} \end{bmatrix}, \qquad \boldsymbol{W} = \begin{bmatrix} w_{1,1} & w_{1,2} & w_{1,3} \\ w_{2,1} & w_{2,2} & w_{2,3} \\ w_{3,1} & w_{3,2} & w_{3,3} \end{bmatrix}. \end{eqnarray}
    (7.23)

    The output activation at \((i,j)\) is

    \begin{eqnarray} a_{i,j} = \mathscr{A}\!\Big( \sum_{p=1}^{3}\sum_{q=1}^{3} w_{p,q}\, x_{i+p-1,\,j+q-1} - b \Big), \end{eqnarray}
    (7.24)

    for \(i,j=1,2\).

    Remark:
    [Pooling in Two Dimensions]

    A two-dimensional max-pooling layer with window size \((r^{(h)}_1, r^{(w)}_1)\) and stride \((s_h, s_w)\) acts on the feature map \(A = [a_{i,j}]\) as

    \[ a^{(\text{pool})}_{i,j} = \max_{(p,q)\in\mathcal{N}_{i,j}} a_{p,q}, \]

    where \(\mathcal{N}_{i,j}\) denotes the set of indices covered by the pooling window centered around \((i,j)\). This reduces the spatial resolution while improving translational invariance.

    Similarly, min-pooling layer and average-pooling layer can also be defined.

    Recurrent Neural Networks

    Refer to Chapter 17 (page 543) in the following book:

    Calin, Ovidiu, Deep Learning Architectures: A Mathematical Approach, Springer, 2020.

    Click here to see the details of the book

    Transformers

    Refer to the following book, Chapter 12 on page 357:

    Bishop, Christopher M. and Bishop, Hugh, Deep Learning: Foundations and Concepts, Springer, 2024.

    Click here to see the details of the book