Special Architectures
So far, we have studied multi-layer perceptrons (MLPs), which are fully connected (dense) feedforward neural networks. We have also seen that these networks perform well for a verity of pattern recognition problems that can be viewed as function approximation tasks that are consistent with their property as universal approximators. We also extended our study to some recent topics, exploring their training dynamics through the Neural Tangent Kernel (NTK), and their role in scientific computing through Physics-Informed Nerual Networks (PINNs).
In this chapter, we focus on some special architectures that are designed for specific types of data or applications, like convolution neural networks (CNNs), which are particularly suitable for image recognition, and recurrent neural networks (RNNs) suitable for language and sequence modelling. We conclude the chapter with a discussion on a recent development called Transformers, which works through attention-based architecture that has revolutionized sequence modelling. Transformers provide a unified framework for learning from text, images, and beyond.
Convolution Neural Network
The dense structure of MLPs treat each input neuron independently and ignore spatial structure of the input data. For example, in image data split into pixels, nearby pixels are highly correlated and the certain local features like edges or corners can appear at different locations in the image. This also leads to a very large number of parameters which are still poorly scalable for high-resolution images. Convolution Neural Networks (CNNs) overcome these limitations by incorporating local spatial structure of data through local connectivity and weight sharing.
In local connectivity, each neuron is connected only to a small spatial region (patch) of the input in order to capture local patterns. Through weight sharing, the same set of weights, referred to as a filter or kernel, is used across different spatial locations. This enables translation invariance, which helps in reducing the number of parameters. Pooling layers are often added to reduce spacial dimensions and to retain the most important features. In this section, we briefly outline CNNs.
Networks of One-Dimensional Signals
In many applications, such as time-series analysis, speech recognition, or any sequence-based data, the input is naturally one-dimensional. For such data, one-dimensional convolutional neural networks (1D CNNs) provide an efficient architecture that captures local dependencies along the temporal or sequential axis.
Let \(f, g : \mathbb{R} \to \mathbb{R}\). The continuous one-dimensional convolution of \(f\) and \(w\), denoted by \((f * g)\), is defined as
whenever the integral exists.
Analogously, in the discrete case where the signals are sequences, say, \(f,g:\mathbb{Z}\rightarrow \mathbb{R}\) given by
the discrete convolution is defined by
In practice, we have signals of finite length. In this case, we extend the signal by zero, which we refer to as signal with finite support. For instance, if we have
for some \(N>0\), then it should be understood as extending it to zero on either side and considering the signal of infinite length \(g=[\ldots, 0, \boldsymbol{g}, 0, \ldots]\).
In CNNs, it is customary to use the cross-correlation operation (instead of strict convolution), given by
where \(\boldsymbol{w} = [w_0, w_1, \ldots, w_{r-1}]\) is the kernel or filter of size \(r\) and \(\boldsymbol{g}=[g_0, g_1, \ldots, g_{N-1}]\) is finite input signal with \(0 < r \le N\).
Let \( \boldsymbol{x} = [x_1,\, x_2,\, \ldots,\, x_N] \) be a discrete-time signal (or sequence). A moving average with window size \(r\) computes each output value as the average of \(r\) consecutive inputs, given by
This can be viewed as a cross-correlation (or convolution without kernel flipping) between the input \(\boldsymbol{x}\) and the kernel
that is,
Thus, the moving average is a special case of convolution where all kernel weights are equal and sum to one.
The moving average smooths the signal by reducing local fluctuations and it acts as a low-pass filter, retaining slow variations in \(\boldsymbol{x}\) and attenuating rapid changes.
The one-dimensional moving average filter discussed above can be viewed as a low-pass filter, since it smooths rapid variations in the input signal. In two dimensions, the same idea extends naturally to image processing, where a smoothing kernel corresponds to blurring an image.
In contrast, edge detection in images corresponds to emphasizing changes (differences) in the signal, i.e., high-frequency components. In one dimension, this can be seen directly through convolution with a difference kernel or high-pass filter.
Consider a one-dimensional input signal \( \boldsymbol{x} = [x_1,\, x_2,\, \ldots,\, x_N] \) and a difference kernel (or edge detection kernel)
The convolution (or, in CNNs, the cross-correlation) of \(\boldsymbol{x}\) and \(\boldsymbol{w}\) produces the output
Hence,
This output measures the change (or discrete gradient) between successive input values. Regions where \(\boldsymbol{x}\) is approximately constant produce small responses, while sharp transitions yield large responses. Hence, this corresponding to edges in one-dimensional data.
In two dimensions, similar difference kernels (e.g., Sobel or Prewitt filters) are used to detect edges in images.
One-Dimensional Convolutional Layer
Let the input to a 1D convolutional layer be a sequence (or signal)
and let the convolutional kernel (or filter) be
where the integer \(1\le r\le N\) is called the kernel size or receptive field length.
For a given integer \(s \ge 1\), a convolutional layer with activation function \(\mathscr{A}\), convolution kernel \(\boldsymbol{w}\) and bias \(b\) produces the output
where
The number \(s\) is called the stride, which specifies the step size with which the kernel moves along the input.
The convolution layer replaces the affine transformation \(\boldsymbol{w}\cdot\boldsymbol{x} - b\) of a neuron in an MLP by the local convolution operation \(\boldsymbol{w} * \boldsymbol{x} - b\). Unlike an MLP layer, a convolution layer is not organized neuron-wise, since each output activation (feature map entry) depends only on a local receptive field of the input rather than the entire input vector. Nevertheless, the convolution operation can be equivalently expressed as a matrix-vector multiplication, where the kernel \(\boldsymbol{w}\) is expanded into a structured (sparse Toeplitz) matrix. Thus, a convolutional layer can be viewed as a special kind of fully connected layer.
Show that a one-dimensional convolutional layer can be expressed as a fully connected (dense) layer with a specially structured weight (sparse Toeplitz) matrix.
Consider an input sequence
and a kernel (filter)
with kernel size \(r = 2\), stride \(s = 1\), and bias \(b\).
According to the convolutional layer definition,
where
Thus, the layer output activations are
Hence, the output feature map (activation vector) of the convolution layer is
The number of zeros added to both ends of the input sequence before performing convolution is called padding. Padding ensures that edge elements receive equal treatment and can preserve the original input length. If we pad with \(p\) zeros on each side, the effective input length becomes \(N + 2p\).
Given an input of length \(N\), a kernel of size \(r\), a stride \(s\), and padding \(p\), the convolutional layer is given by
where
After a convolutional layer produces its output (called the feature map), it is often followed by a pooling layer. Given an input feature map
a max-pooling layer with pooling window of size \(p\) and stride \(s_p\) produces the output
where \(\mathcal{N}(i)\) denotes a small neighborhood of indices around \(i\), typically
and
If the maximum operation is replaced by the average, we obtain the average pooling layer. Similarly, if the maximum operation is replaced by the minimum operation, then we obtain the min-pooling layer.
Here, \(\boldsymbol{z}\) denotes the output feature map obtained from the preceding convolutional layer.
The pooling operation acts on \(\boldsymbol{z}\) to produce a lower-dimensional representation
\(\boldsymbol{z}^{(\text{pool})}\), which can be fed to the next convolutional or fully connected layer.
Schematically, we have
Thus, pooling serves two main purposes, namely,
Let the convolution output be
and consider max pooling with pooling size \(p=2\) and stride \(s_p=2\). Then,
Hence, the pooled output is
which has reduced length compared to the original feature map.
Let us now illustrate translational invariance through pooling.
Applying max pooling with window size \(2\) and stride \(2\) gives
If the input is slightly shifted, producing
then the pooled output remains
Hence, pooling increases translational invariance, meaning that small shifts in the input signal do not significantly affect the pooled output.
Two-Dimensional Convolutional Layer
We now extend the one-dimensional convolutional layer to the two-dimensional case, which is the core operation in convolutional neural networks (CNNs) for image processing.
Convolution Operator
In this subsection, we briefly recall the basic definitions concerning convolution. We first start with convolution of two functions from \(\mathbb{R}^2\) to \(\mathbb{R}\).
Let \(f, g : \mathbb{R}^2 \to \mathbb{R}\). The convolution of \(f\) and \(g\), denoted by \((f * g)\), is defined by
whenever the integral exists.
The continuous definition of the convolution operator can be restricted to the discrete case, leading to the definition of convolution between two matrices.
Let \(A = [a_{i,j}] \in \mathbb{R}^{m \times n}\) and \(K = [k_{p,q}] \in \mathbb{R}^{r \times s}\). The discrete convolution of \(A\) and \(K\), denoted by \((K * A)\), is defined by
where \(a_{i - p,\, j - q} = 0\) whenever the indices fall outside the range of \(A\).
CNNs typically use the cross-correlation operation given by
with compact support, which means that there exists an integer \(r \ge 1\) and \(c \ge 1\) such that \(a_{i,j}=0\) for all \(|i| \ge r\) and \(|j| \ge c\). In this case, we say that \(W\) has an \(r\times c\) support.
Given a two dimensional discrete signal \(A\), the \(2\times 2\) supported kernel that produces a simple moving average is given by
where \(w_{0,0}=w_{0,1}=w_{1,0}=w_{1,1}=1/4\), say, and the matrix is extended by zeros in all four directions. Then the cross-correlation leads to the simple moving average
This operation replaces each pixel value by the average of its four neighboring pixels, thereby producing a blurred or smoothed version of the image.
Averaging replaces each pixel by the mean of its local neighborhood, which suppresses rapid spatial variations (high-frequency components) while preserving slowly varying content (low-frequency components). Consequently, sharp transitions (edges) and noise are attenuated, yielding a blurred (smoothed) image. This relates it to the concept of a low-pass filter.
Consider a \(3\times3\) mean filter (or box filter) kernel
Show that the operation produces a blurred image (equivalently, show that the kernel \(W\) is a low-pass filter.
A smoother and more natural way to blur an image is to use a Gaussian blur kernel, whose entries are proportional to a discrete approximation of a two-dimensional Gaussian function. A commonly used \(3\times3\) Gaussian blur kernel is
Convolving an image \(X\) with \(W_{\text{G}}\) produces the output
where \(c_{p,q}\) are the corresponding integer weights shown above.
The weights are largest at the center and decrease symmetrically outward, so this filter performs a weighted average that preserves overall image brightness while reducing noise and small-scale variations. Unlike the simple mean filter, the Gaussian kernel avoids sharp artifacts and produces a more visually natural blurred image.
The mean filter smooths (blurs) an image by averaging nearby pixels, whereas an edge detection filter highlights regions of sharp intensity change as illustrated in the following example.
A simple vertical edge detection kernel is
and a corresponding horizontal edge detection kernel is
For an input image \(X = [x_{i,j}]\), the feature maps are computed as
Each of these responds strongly where there is a horizontal or vertical edge, respectively.
The matrices \(W_v\) and \(W_h\) are called Sobel filters (or sometimes Prewitt filters, depending on scaling).
The overall edge magnitude can be approximated by combining both directions:
2D Convolution Layer
Let us restrict to grayscale images, where the input and the kernel are two-dimensional arrays (matrices).
Let the input to the layer be a two-dimensional array
where \(n_{0}^{(h)}\) and \(n_{0}^{(w)}\) denote the height and width of the input.
Let the convolutional kernel (or filter) be
where \(r^{(h)}_1\) and \(r^{(h)}_1\) denote the kernel height and width, respectively. Let \(b \in \mathbb{R}\) be the bias term and \(\mathscr{A}\) be a nonlinear activation function.
For given stride values \(s_h, s_w \ge 1\) in the vertical and horizontal directions, respectively, the convolutional layer produces the output
where
are the output dimensions. The pair \((s_h, s_w)\) is called the stride.
If zero-padding of size \(p_h\) and \(p_w\) is applied to the height and width directions, the effective input dimensions become
and the output dimensions are given by
Consider the input image and the kernel, respectively,
The output activation at \((i,j)\) is
for \(i,j=1,2\).
A two-dimensional max-pooling layer with window size \((r^{(h)}_1, r^{(w)}_1)\) and stride \((s_h, s_w)\) acts on the feature map \(A = [a_{i,j}]\) as
where \(\mathcal{N}_{i,j}\) denotes the set of indices covered by the pooling window centered around \((i,j)\). This reduces the spatial resolution while improving translational invariance.
Similarly, min-pooling layer and average-pooling layer can also be defined.
Recurrent Neural Networks
Refer to Chapter 17 (page 543) in the following book:
Calin, Ovidiu, Deep Learning Architectures: A Mathematical Approach, Springer, 2020.
Click here to see the details of the book
Transformers
Refer to the following book, Chapter 12 on page 357:
Bishop, Christopher M. and Bishop, Hugh, Deep Learning: Foundations and Concepts, Springer, 2024.