    # Cara menggunakan derivative lambda function python

untuk jawaban dari tugas saya akan membuat 4 jawaban yang terdiri dari 2 cara yang berbeda, yaitu menggunakan perhitungan matematis dan menggunakan alat. lalu dari 2 cara itu dibagi lagi menjadi 2 yaitu dengan input secara langsung dan menggunakan analisa jawaban. 1. perhitungan matematis

cara yang dipakai dengan cara langsung yaitu saya langsung menginput x ke dalam fungsi sehingga didapat eror karena akan menyebabkan jawaban menjadi tak terdefinisi.

cara analisi yang dipakai untuk cara matematis yaitu melakukan eliminasi dengan cara memfaktorkan fungsi lalu mengeliminasi pembilang dengan penyebut. lalu menginput x kedalam fungsi.

2. perhitungan aplikasi

seperti cara matematis, perhitungan langsung menggunakan aplikasi juga hanya menginput x ke fungsi tanpa menggunakan analisa

cara analisis ini menggunakan aplikasi dengan menggunakan fungsi limit pada aplikasi yang saya gunakan dan mendapat output yang sesuai.

## Pertemuan kedua (14 febuari 2020)

ilmu metode numerik diibaratkan celup jari kelingking disebuah lautan, kemudian diangkat dan terdapat tetesan. tetesan itu adalah sedikit ilmu dari lautan ilmu yang luas. tetesan tersebut hanyalah air bukan minyak, oleh karena itu mereka akan menyatu karna sesama air. oleh karena itu metode numerik memiliki hubungan pada kehidupan.
pada ilmu matematis ketika sebuah angka dibagi oleh angka itu sendiri akan menghasilkan jawaban-nya 1. namun hal ini tidak berlaku pada 0/0. ketika kita melakukan 0/0 akan terjadi eror/tak terdefinisi.

• bab7:initial value problem

Data -> informasi -> pengetahuan -> ilmu yang bermanfaat-> hikmah->hidayah

1. soal yang menarik untuk saya

referensi: kiusalaas,Jaan.2013.NUMERICAL METHODS IN ENGINEERING WITH PYTHON 3.USA.Cambridge

2. Tugas Metnum 2(1) untuk soal yang menarik saya telah menemukan penyelesainnya secara numerik. soal yang ditanyakan adalah mencari luas dari batang yang (b) yang menghubungkan 2 truss.selain itu juga saya harus mencari angle daro teta untuk memperminim volume. dengan stress<= 150 MPa dan strain =<5 mm dan E = 200+E9. module yang saya gunakan cukup banyak(menurut rekomendasi dari internet). pertama ada numpy untuk melakukan numerical calculation seperti array, cos, sin dan pi. lalu module math untuk mengerjakan eksekusi matematis. setelah itu ada module manual yaitu bracket, search, powell, objective_fn, constraint_fns dan lagrangian_function. dari output didapat pertama first constraint sebesar 50000 pascal yang dikalkulasi dengan formula P/(2 * sin(theta)) dengan stress maksimum 150000000 pascal dan luas permukaan (A)0.000333m**2. second constraint sebesar 0.000001 pascal dikalkulasi dengan formula (P * b)/( 2 * E * sin(2*theta) * sin(theta) ). dengan stress maksimum 0.005 pascal dan luas permukaan 0.000231m**2. setelah itu didapat initial guess(jawaban sementara) untuk A sebesar 0.000282m**2. setelah itu didapat constraint lambda sebesar 100000, lalu minimum points sebesar 0.632 untuk C1 dan 0.000282. didapat function value sbesar 0.0014 dan constraint lambda sebesar 1000000 dengan minimum point C1 sebesar 0.632 dan C2 sebesar 0.000282 dengan function value sebesar 0.0014

Error creating thumbnail: File with dimensions greater than 12.5 MP

3. Tugas metnum 2 initial value

soal yang saya kerjakan merupakan initial value problem dari first order derivative. dengan menurunkan fungsi yang telah diberikan, dan juga nilai y yang diketahui, maka kita dapat mencari nilai c pada first order.

## Tugas minggu 4

import numpy as np from numpy import asarray from numpy.polynomial.polynomial import polyfit import pylab as plt

x = np.asarray([11,12,13,14,15,16,17,18,19,20]) y = np.asarray([618.230,735.936,863.896,1002.142,1150.638,1309.408,1478.436,1657.748,1847.352,2047.18])

b, m = polyfit(x, y, 1) w = np.polyfit(x,y,1) plt.plot(x, y, '.') plt.plot(x, b + m * x, '-') plt.show() print(w)

As we design increasingly deeper networks it becomes imperative to understand how adding layers can increase the complexity and expressiveness of the network. Even more important is the ability to design networks where adding layers makes networks strictly more expressive rather than just different. To make some progress we need a bit of mathematics.

## 8.6.1. Function Classes

Consider $$\mathcal{F}$$, the class of functions that a specific network architecture (together with learning rates and other hyperparameter settings) can reach. That is, for all $$f \in \mathcal{F}$$ there exists some set of parameters (e.g., weights and biases) that can be obtained through training on a suitable dataset. Let’s assume that $$f^*$$ is the “truth” function that we really would like to find. If it is in $$\mathcal{F}$$, we are in good shape but typically we will not be quite so lucky. Instead, we will try to find some $$f^*_\mathcal{F}$$ which is our best bet within $$\mathcal{F}$$. For instance, given a dataset with features $$\mathbf{X}$$ and labels $$\mathbf{y}$$, we might try finding it by solving the following optimization problem:

(8.6.1)$f^*_\mathcal{F} \stackrel{\mathrm{def}}{=} \mathop{\mathrm{argmin}}_f L(\mathbf{X}, \mathbf{y}, f) \text{ subject to } f \in \mathcal{F}.$

We know that regularization (, ) may control complexity of $$\mathcal{F}$$ and achieve consistency, so a larger size of training data generally leads to better $$f^*_\mathcal{F}$$. It is only reasonable to assume that if we design a different and more powerful architecture $$\mathcal{F}'$$ we should arrive at a better outcome. In other words, we would expect that $$f^*_{\mathcal{F}'}$$ is “better” than $$f^*_{\mathcal{F}}$$. However, if $$\mathcal{F} \not\subseteq \mathcal{F}'$$ there is no guarantee that this should even happen. In fact, $$f^*_{\mathcal{F}'}$$ might well be worse. As illustrated by , for non-nested function classes, a larger function class does not always move closer to the “truth” function $$f^*$$. For instance, on the left of , though $$\mathcal{F}_3$$ is closer to $$f^*$$ than $$\mathcal{F}_1$$, $$\mathcal{F}_6$$ moves away and there is no guarantee that further increasing the complexity can reduce the distance from $$f^*$$. With nested function classes where $$\mathcal{F}_1 \subseteq \ldots \subseteq \mathcal{F}_6$$ on the right of , we can avoid the aforementioned issue from the non-nested function classes.

Fig. 8.6.1 For non-nested function classes, a larger (indicated by area) function class does not guarantee to get closer to the “truth” function ($$f^*$$). This does not happen in nested function classes.

Thus, only if larger function classes contain the smaller ones are we guaranteed that increasing them strictly increases the expressive power of the network. For deep neural networks, if we can train the newly-added layer into an identity function $$f(\mathbf{x}) = \mathbf{x}$$, the new model will be as effective as the original model. As the new model may get a better solution to fit the training dataset, the added layer might make it easier to reduce training errors.

This is the question that () considered when working on very deep computer vision models. At the heart of their proposed residual network (ResNet) is the idea that every additional layer should more easily contain the identity function as one of its elements. These considerations are rather profound but they led to a surprisingly simple solution, a residual block. With it, ResNet won the ImageNet Large Scale Visual Recognition Challenge in 2015. The design had a profound influence on how to build deep neural networks. For instance, residual blocks have been added to recurrent networks (, ). Likewise, Transformers () use them to stack many layers of networks efficiently. It is also used in graph neural networks () and, as a basic concept, it has been used extensively in computer vision (, ). Note that residual networks are predated by highway networks () that share some of the motivation, albeit without the elegant parametrization around the identity function.

## 8.6.2. Residual Blocks

Let’s focus on a local part of a neural network, as depicted in . Denote the input by $$\mathbf{x}$$. We assume that the desired underlying mapping we want to obtain by learning is $$f(\mathbf{x})$$, to be used as input to the activation function on the top. On the left, the portion within the dotted-line box must directly learn the mapping $$f(\mathbf{x})$$. On the right, the portion within the dotted-line box needs to learn the residual mapping $$g(\mathbf{x}) = f(\mathbf{x}) - \mathbf{x}$$, which is how the residual block derives its name. If the identity mapping $$f(\mathbf{x}) = \mathbf{x}$$ is the desired underlying mapping, the residual mapping amounts to $$g(\mathbf{x}) = 0$$ and it is thus easier to learn: we only need to push the weights and biases of the upper weight layer (e.g., fully connected layer and convolutional layer) within the dotted-line box to zero. The right figure illustrates the residual block of ResNet, where the solid line carrying the layer input $$\mathbf{x}$$ to the addition operator is called a residual connection (or shortcut connection). With residual blocks, inputs can forward propagate faster through the residual connections across layers. In fact, the residual block can be thought of as a special case of the multi-branch Inception block: it has two branches one of which is the identity mapping.

Fig. 8.6.2 In a regular block (left), the portion within the dotted-line box must directly learn the mapping $$f(\mathbf{x})$$. In a residual block (right), the portion within the dotted-line box needs to learn the residual mapping $$g(\mathbf{x}) = f(\mathbf{x}) - \mathbf{x}$$, making the identity mapping $$f(\mathbf{x}) = \mathbf{x}$$ easier to learn.

ResNet follows VGG’s full $$3\times 3$$ convolutional layer design. The residual block has two $$3\times 3$$ convolutional layers with the same number of output channels. Each convolutional layer is followed by a batch normalization layer and a ReLU activation function. Then, we skip these two convolution operations and add the input directly before the final ReLU activation function. This kind of design requires that the output of the two convolutional layers has to be of the same shape as the input, so that they can be added together. If we want to change the number of channels, we need to introduce an additional $$1\times 1$$ convolutional layer to transform the input into the desired shape for the addition operation. Let’s have a look at the code below.

import torch
from torch import nn
from torch.nn import functional as F
from d2l import torch as d2l

class Residual(nn.Module):  #@save
"""The Residual block of ResNet."""
def __init__(self, num_channels, use_1x1conv=False, strides=1):
super().__init__()
stride=strides)
if use_1x1conv:
self.conv3 = nn.LazyConv2d(num_channels, kernel_size=1,
stride=strides)
else:
self.conv3 = None
self.bn1 = nn.LazyBatchNorm2d()
self.bn2 = nn.LazyBatchNorm2d()

def forward(self, X):
Y = F.relu(self.bn1(self.conv1(X)))
Y = self.bn2(self.conv2(Y))
if self.conv3:
X = self.conv3(X)
Y += X
return F.relu(Y)


from mxnet import init, np, npx
from mxnet.gluon import nn
from d2l import mxnet as d2l

npx.set_np()

class Residual(nn.Block):  #@save
"""The Residual block of ResNet."""
def __init__(self, num_channels, use_1x1conv=False, strides=1, **kwargs):
super().__init__(**kwargs)
strides=strides)
if use_1x1conv:
self.conv3 = nn.Conv2D(num_channels, kernel_size=1,
strides=strides)
else:
self.conv3 = None
self.bn1 = nn.BatchNorm()
self.bn2 = nn.BatchNorm()

def forward(self, X):
Y = npx.relu(self.bn1(self.conv1(X)))
Y = self.bn2(self.conv2(Y))
if self.conv3:
X = self.conv3(X)
return npx.relu(Y + X)


import jax
from flax import linen as nn
from jax import numpy as jnp
from d2l import jax as d2l

class Residual(nn.Module):  #@save
"""The Residual block of ResNet."""
num_channels: int
use_1x1conv: bool = False
strides: tuple = (1, 1)
training: bool = True

def setup(self):
self.conv1 = nn.Conv(self.num_channels, kernel_size=(3, 3),
self.conv2 = nn.Conv(self.num_channels, kernel_size=(3, 3),
if self.use_1x1conv:
self.conv3 = nn.Conv(self.num_channels, kernel_size=(1, 1),
strides=self.strides)
else:
self.conv3 = None
self.bn1 = nn.BatchNorm(not self.training)
self.bn2 = nn.BatchNorm(not self.training)

def __call__(self, X):
Y = nn.relu(self.bn1(self.conv1(X)))
Y = self.bn2(self.conv2(Y))
if self.conv3:
X = self.conv3(X)
Y += X
return nn.relu(Y)


import tensorflow as tf
from d2l import tensorflow as d2l

class Residual(tf.keras.Model):  #@save
"""The Residual block of ResNet."""
def __init__(self, num_channels, use_1x1conv=False, strides=1):
super().__init__()
kernel_size=3, strides=strides)
self.conv2 = tf.keras.layers.Conv2D(num_channels, kernel_size=3,
self.conv3 = None
if use_1x1conv:
self.conv3 = tf.keras.layers.Conv2D(num_channels, kernel_size=1,
strides=strides)
self.bn1 = tf.keras.layers.BatchNormalization()
self.bn2 = tf.keras.layers.BatchNormalization()

def call(self, X):
Y = tf.keras.activations.relu(self.bn1(self.conv1(X)))
Y = self.bn2(self.conv2(Y))
if self.conv3 is not None:
X = self.conv3(X)
Y += X
return tf.keras.activations.relu(Y)


This code generates two types of networks: one where we add the input to the output before applying the ReLU nonlinearity whenever

/home/d2l-worker/miniconda3/envs/d2l-en-release-1/lib/python3.9/site-packages/torch/nn/modules/lazy.py:178: UserWarning: Lazy modules are a new feature under heavy development so changes to the API or functionality can happen at any moment.
warnings.warn('Lazy modules are a new feature under heavy development '

7, and one where we adjust channels and resolution by means of a $$1 \times 1$$ convolution before adding. illustrates this.

Fig. 8.6.3 ResNet block with and without $$1 \times 1$$ convolution, which transforms the input into the desired shape for the addition operation.

Now let’s look at a situation where the input and output are of the same shape, where $$1 \times 1$$ convolution is not needed.

blk = Residual(3)
X = torch.randn(4, 3, 6, 6)
blk(X).shape


/home/d2l-worker/miniconda3/envs/d2l-en-release-1/lib/python3.9/site-packages/torch/nn/modules/lazy.py:178: UserWarning: Lazy modules are a new feature under heavy development so changes to the API or functionality can happen at any moment.
warnings.warn('Lazy modules are a new feature under heavy development '


torch.Size([4, 3, 6, 6])


blk = Residual(3)
blk.initialize()
X = np.random.randn(4, 3, 6, 6)
blk(X).shape


(4, 3, 6, 6)


blk = Residual(3)
X = jax.random.normal(d2l.get_key(), (4, 6, 6, 3))
blk.init_with_output(d2l.get_key(), X).shape


from mxnet import init, np, npx
from mxnet.gluon import nn
from d2l import mxnet as d2l

npx.set_np()

class Residual(nn.Block):  #@save
"""The Residual block of ResNet."""
def __init__(self, num_channels, use_1x1conv=False, strides=1, **kwargs):
super().__init__(**kwargs)
strides=strides)
if use_1x1conv:
self.conv3 = nn.Conv2D(num_channels, kernel_size=1,
strides=strides)
else:
self.conv3 = None
self.bn1 = nn.BatchNorm()
self.bn2 = nn.BatchNorm()

def forward(self, X):
Y = npx.relu(self.bn1(self.conv1(X)))
Y = self.bn2(self.conv2(Y))
if self.conv3:
X = self.conv3(X)
return npx.relu(Y + X)

0

from mxnet import init, np, npx
from mxnet.gluon import nn
from d2l import mxnet as d2l

npx.set_np()

class Residual(nn.Block):  #@save
"""The Residual block of ResNet."""
def __init__(self, num_channels, use_1x1conv=False, strides=1, **kwargs):
super().__init__(**kwargs)
strides=strides)
if use_1x1conv:
self.conv3 = nn.Conv2D(num_channels, kernel_size=1,
strides=strides)
else:
self.conv3 = None
self.bn1 = nn.BatchNorm()
self.bn2 = nn.BatchNorm()

def forward(self, X):
Y = npx.relu(self.bn1(self.conv1(X)))
Y = self.bn2(self.conv2(Y))
if self.conv3:
X = self.conv3(X)
return npx.relu(Y + X)

1

from mxnet import init, np, npx
from mxnet.gluon import nn
from d2l import mxnet as d2l

npx.set_np()

class Residual(nn.Block):  #@save
"""The Residual block of ResNet."""
def __init__(self, num_channels, use_1x1conv=False, strides=1, **kwargs):
super().__init__(**kwargs)
strides=strides)
if use_1x1conv:
self.conv3 = nn.Conv2D(num_channels, kernel_size=1,
strides=strides)
else:
self.conv3 = None
self.bn1 = nn.BatchNorm()
self.bn2 = nn.BatchNorm()

def forward(self, X):
Y = npx.relu(self.bn1(self.conv1(X)))
Y = self.bn2(self.conv2(Y))
if self.conv3:
X = self.conv3(X)
return npx.relu(Y + X)

2

We also have the option to halve the output height and width while increasing the number of output channels. In this case we use $$1 \times 1$$ convolutions via

/home/d2l-worker/miniconda3/envs/d2l-en-release-1/lib/python3.9/site-packages/torch/nn/modules/lazy.py:178: UserWarning: Lazy modules are a new feature under heavy development so changes to the API or functionality can happen at any moment.
warnings.warn('Lazy modules are a new feature under heavy development '

8. This comes in handy at the beginning of each ResNet block to reduce the spatial dimensionality via
/home/d2l-worker/miniconda3/envs/d2l-en-release-1/lib/python3.9/site-packages/torch/nn/modules/lazy.py:178: UserWarning: Lazy modules are a new feature under heavy development so changes to the API or functionality can happen at any moment.
warnings.warn('Lazy modules are a new feature under heavy development '

9.

from mxnet import init, np, npx
from mxnet.gluon import nn
from d2l import mxnet as d2l

npx.set_np()

class Residual(nn.Block):  #@save
"""The Residual block of ResNet."""
def __init__(self, num_channels, use_1x1conv=False, strides=1, **kwargs):
super().__init__(**kwargs)
strides=strides)
if use_1x1conv:
self.conv3 = nn.Conv2D(num_channels, kernel_size=1,
strides=strides)
else:
self.conv3 = None
self.bn1 = nn.BatchNorm()
self.bn2 = nn.BatchNorm()

def forward(self, X):
Y = npx.relu(self.bn1(self.conv1(X)))
Y = self.bn2(self.conv2(Y))
if self.conv3:
X = self.conv3(X)
return npx.relu(Y + X)

3

from mxnet import init, np, npx
from mxnet.gluon import nn
from d2l import mxnet as d2l

npx.set_np()

class Residual(nn.Block):  #@save
"""The Residual block of ResNet."""
def __init__(self, num_channels, use_1x1conv=False, strides=1, **kwargs):
super().__init__(**kwargs)
strides=strides)
if use_1x1conv:
self.conv3 = nn.Conv2D(num_channels, kernel_size=1,
strides=strides)
else:
self.conv3 = None
self.bn1 = nn.BatchNorm()
self.bn2 = nn.BatchNorm()

def forward(self, X):
Y = npx.relu(self.bn1(self.conv1(X)))
Y = self.bn2(self.conv2(Y))
if self.conv3:
X = self.conv3(X)
return npx.relu(Y + X)

4

from mxnet import init, np, npx
from mxnet.gluon import nn
from d2l import mxnet as d2l

npx.set_np()

class Residual(nn.Block):  #@save
"""The Residual block of ResNet."""
def __init__(self, num_channels, use_1x1conv=False, strides=1, **kwargs):
super().__init__(**kwargs)
strides=strides)
if use_1x1conv:
self.conv3 = nn.Conv2D(num_channels, kernel_size=1,
strides=strides)
else:
self.conv3 = None
self.bn1 = nn.BatchNorm()
self.bn2 = nn.BatchNorm()

def forward(self, X):
Y = npx.relu(self.bn1(self.conv1(X)))
Y = self.bn2(self.conv2(Y))
if self.conv3:
X = self.conv3(X)
return npx.relu(Y + X)

5

from mxnet import init, np, npx
from mxnet.gluon import nn
from d2l import mxnet as d2l

npx.set_np()

class Residual(nn.Block):  #@save
"""The Residual block of ResNet."""
def __init__(self, num_channels, use_1x1conv=False, strides=1, **kwargs):
super().__init__(**kwargs)
strides=strides)
if use_1x1conv:
self.conv3 = nn.Conv2D(num_channels, kernel_size=1,
strides=strides)
else:
self.conv3 = None
self.bn1 = nn.BatchNorm()
self.bn2 = nn.BatchNorm()

def forward(self, X):
Y = npx.relu(self.bn1(self.conv1(X)))
Y = self.bn2(self.conv2(Y))
if self.conv3:
X = self.conv3(X)
return npx.relu(Y + X)

6

from mxnet import init, np, npx
from mxnet.gluon import nn
from d2l import mxnet as d2l

npx.set_np()

class Residual(nn.Block):  #@save
"""The Residual block of ResNet."""
def __init__(self, num_channels, use_1x1conv=False, strides=1, **kwargs):
super().__init__(**kwargs)
strides=strides)
if use_1x1conv:
self.conv3 = nn.Conv2D(num_channels, kernel_size=1,
strides=strides)
else:
self.conv3 = None
self.bn1 = nn.BatchNorm()
self.bn2 = nn.BatchNorm()

def forward(self, X):
Y = npx.relu(self.bn1(self.conv1(X)))
Y = self.bn2(self.conv2(Y))
if self.conv3:
X = self.conv3(X)
return npx.relu(Y + X)

7

from mxnet import init, np, npx
from mxnet.gluon import nn
from d2l import mxnet as d2l

npx.set_np()

class Residual(nn.Block):  #@save
"""The Residual block of ResNet."""
def __init__(self, num_channels, use_1x1conv=False, strides=1, **kwargs):
super().__init__(**kwargs)
strides=strides)
if use_1x1conv:
self.conv3 = nn.Conv2D(num_channels, kernel_size=1,
strides=strides)
else:
self.conv3 = None
self.bn1 = nn.BatchNorm()
self.bn2 = nn.BatchNorm()

def forward(self, X):
Y = npx.relu(self.bn1(self.conv1(X)))
Y = self.bn2(self.conv2(Y))
if self.conv3:
X = self.conv3(X)
return npx.relu(Y + X)

8

from mxnet import init, np, npx
from mxnet.gluon import nn
from d2l import mxnet as d2l

npx.set_np()

class Residual(nn.Block):  #@save
"""The Residual block of ResNet."""
def __init__(self, num_channels, use_1x1conv=False, strides=1, **kwargs):
super().__init__(**kwargs)
strides=strides)
if use_1x1conv:
self.conv3 = nn.Conv2D(num_channels, kernel_size=1,
strides=strides)
else:
self.conv3 = None
self.bn1 = nn.BatchNorm()
self.bn2 = nn.BatchNorm()

def forward(self, X):
Y = npx.relu(self.bn1(self.conv1(X)))
Y = self.bn2(self.conv2(Y))
if self.conv3:
X = self.conv3(X)
return npx.relu(Y + X)

3

import jax
from flax import linen as nn
from jax import numpy as jnp
from d2l import jax as d2l

class Residual(nn.Module):  #@save
"""The Residual block of ResNet."""
num_channels: int
use_1x1conv: bool = False
strides: tuple = (1, 1)
training: bool = True

def setup(self):
self.conv1 = nn.Conv(self.num_channels, kernel_size=(3, 3),
self.conv2 = nn.Conv(self.num_channels, kernel_size=(3, 3),
if self.use_1x1conv:
self.conv3 = nn.Conv(self.num_channels, kernel_size=(1, 1),
strides=self.strides)
else:
self.conv3 = None
self.bn1 = nn.BatchNorm(not self.training)
self.bn2 = nn.BatchNorm(not self.training)

def __call__(self, X):
Y = nn.relu(self.bn1(self.conv1(X)))
Y = self.bn2(self.conv2(Y))
if self.conv3:
X = self.conv3(X)
Y += X
return nn.relu(Y)

0

## 8.6.3. ResNet Model

The first two layers of ResNet are the same as those of the GoogLeNet we described before: the $$7\times 7$$ convolutional layer with 64 output channels and a stride of 2 is followed by the $$3\times 3$$ max-pooling layer with a stride of 2. The difference is the batch normalization layer added after each convolutional layer in ResNet.

import jax
from flax import linen as nn
from jax import numpy as jnp
from d2l import jax as d2l

class Residual(nn.Module):  #@save
"""The Residual block of ResNet."""
num_channels: int
use_1x1conv: bool = False
strides: tuple = (1, 1)
training: bool = True

def setup(self):
self.conv1 = nn.Conv(self.num_channels, kernel_size=(3, 3),
self.conv2 = nn.Conv(self.num_channels, kernel_size=(3, 3),
if self.use_1x1conv:
self.conv3 = nn.Conv(self.num_channels, kernel_size=(1, 1),
strides=self.strides)
else:
self.conv3 = None
self.bn1 = nn.BatchNorm(not self.training)
self.bn2 = nn.BatchNorm(not self.training)

def __call__(self, X):
Y = nn.relu(self.bn1(self.conv1(X)))
Y = self.bn2(self.conv2(Y))
if self.conv3:
X = self.conv3(X)
Y += X
return nn.relu(Y)

1

import jax
from flax import linen as nn
from jax import numpy as jnp
from d2l import jax as d2l

class Residual(nn.Module):  #@save
"""The Residual block of ResNet."""
num_channels: int
use_1x1conv: bool = False
strides: tuple = (1, 1)
training: bool = True

def setup(self):
self.conv1 = nn.Conv(self.num_channels, kernel_size=(3, 3),
self.conv2 = nn.Conv(self.num_channels, kernel_size=(3, 3),
if self.use_1x1conv:
self.conv3 = nn.Conv(self.num_channels, kernel_size=(1, 1),
strides=self.strides)
else:
self.conv3 = None
self.bn1 = nn.BatchNorm(not self.training)
self.bn2 = nn.BatchNorm(not self.training)

def __call__(self, X):
Y = nn.relu(self.bn1(self.conv1(X)))
Y = self.bn2(self.conv2(Y))
if self.conv3:
X = self.conv3(X)
Y += X
return nn.relu(Y)

2

import jax
from flax import linen as nn
from jax import numpy as jnp
from d2l import jax as d2l

class Residual(nn.Module):  #@save
"""The Residual block of ResNet."""
num_channels: int
use_1x1conv: bool = False
strides: tuple = (1, 1)
training: bool = True

def setup(self):
self.conv1 = nn.Conv(self.num_channels, kernel_size=(3, 3),
self.conv2 = nn.Conv(self.num_channels, kernel_size=(3, 3),
if self.use_1x1conv:
self.conv3 = nn.Conv(self.num_channels, kernel_size=(1, 1),
strides=self.strides)
else:
self.conv3 = None
self.bn1 = nn.BatchNorm(not self.training)
self.bn2 = nn.BatchNorm(not self.training)

def __call__(self, X):
Y = nn.relu(self.bn1(self.conv1(X)))
Y = self.bn2(self.conv2(Y))
if self.conv3:
X = self.conv3(X)
Y += X
return nn.relu(Y)

3

import jax
from flax import linen as nn
from jax import numpy as jnp
from d2l import jax as d2l

class Residual(nn.Module):  #@save
"""The Residual block of ResNet."""
num_channels: int
use_1x1conv: bool = False
strides: tuple = (1, 1)
training: bool = True

def setup(self):
self.conv1 = nn.Conv(self.num_channels, kernel_size=(3, 3),
self.conv2 = nn.Conv(self.num_channels, kernel_size=(3, 3),
if self.use_1x1conv:
self.conv3 = nn.Conv(self.num_channels, kernel_size=(1, 1),
strides=self.strides)
else:
self.conv3 = None
self.bn1 = nn.BatchNorm(not self.training)
self.bn2 = nn.BatchNorm(not self.training)

def __call__(self, X):
Y = nn.relu(self.bn1(self.conv1(X)))
Y = self.bn2(self.conv2(Y))
if self.conv3:
X = self.conv3(X)
Y += X
return nn.relu(Y)

4

GoogLeNet uses four modules made up of Inception blocks. However, ResNet uses four modules made up of residual blocks, each of which uses several residual blocks with the same number of output channels. The number of channels in the first module is the same as the number of input channels. Since a max-pooling layer with a stride of 2 has already been used, it is not necessary to reduce the height and width. In the first residual block for each of the subsequent modules, the number of channels is doubled compared with that of the previous module, and the height and width are halved.

import jax
from flax import linen as nn
from jax import numpy as jnp
from d2l import jax as d2l

class Residual(nn.Module):  #@save
"""The Residual block of ResNet."""
num_channels: int
use_1x1conv: bool = False
strides: tuple = (1, 1)
training: bool = True

def setup(self):
self.conv1 = nn.Conv(self.num_channels, kernel_size=(3, 3),
self.conv2 = nn.Conv(self.num_channels, kernel_size=(3, 3),
if self.use_1x1conv:
self.conv3 = nn.Conv(self.num_channels, kernel_size=(1, 1),
strides=self.strides)
else:
self.conv3 = None
self.bn1 = nn.BatchNorm(not self.training)
self.bn2 = nn.BatchNorm(not self.training)

def __call__(self, X):
Y = nn.relu(self.bn1(self.conv1(X)))
Y = self.bn2(self.conv2(Y))
if self.conv3:
X = self.conv3(X)
Y += X
return nn.relu(Y)

5

import jax
from flax import linen as nn
from jax import numpy as jnp
from d2l import jax as d2l

class Residual(nn.Module):  #@save
"""The Residual block of ResNet."""
num_channels: int
use_1x1conv: bool = False
strides: tuple = (1, 1)
training: bool = True

def setup(self):
self.conv1 = nn.Conv(self.num_channels, kernel_size=(3, 3),
self.conv2 = nn.Conv(self.num_channels, kernel_size=(3, 3),
if self.use_1x1conv:
self.conv3 = nn.Conv(self.num_channels, kernel_size=(1, 1),
strides=self.strides)
else:
self.conv3 = None
self.bn1 = nn.BatchNorm(not self.training)
self.bn2 = nn.BatchNorm(not self.training)

def __call__(self, X):
Y = nn.relu(self.bn1(self.conv1(X)))
Y = self.bn2(self.conv2(Y))
if self.conv3:
X = self.conv3(X)
Y += X
return nn.relu(Y)

6

import jax
from flax import linen as nn
from jax import numpy as jnp
from d2l import jax as d2l

class Residual(nn.Module):  #@save
"""The Residual block of ResNet."""
num_channels: int
use_1x1conv: bool = False
strides: tuple = (1, 1)
training: bool = True

def setup(self):
self.conv1 = nn.Conv(self.num_channels, kernel_size=(3, 3),
self.conv2 = nn.Conv(self.num_channels, kernel_size=(3, 3),
if self.use_1x1conv:
self.conv3 = nn.Conv(self.num_channels, kernel_size=(1, 1),
strides=self.strides)
else:
self.conv3 = None
self.bn1 = nn.BatchNorm(not self.training)
self.bn2 = nn.BatchNorm(not self.training)

def __call__(self, X):
Y = nn.relu(self.bn1(self.conv1(X)))
Y = self.bn2(self.conv2(Y))
if self.conv3:
X = self.conv3(X)
Y += X
return nn.relu(Y)

7

import jax
from flax import linen as nn
from jax import numpy as jnp
from d2l import jax as d2l

class Residual(nn.Module):  #@save
"""The Residual block of ResNet."""
num_channels: int
use_1x1conv: bool = False
strides: tuple = (1, 1)
training: bool = True

def setup(self):
self.conv1 = nn.Conv(self.num_channels, kernel_size=(3, 3),
self.conv2 = nn.Conv(self.num_channels, kernel_size=(3, 3),
if self.use_1x1conv:
self.conv3 = nn.Conv(self.num_channels, kernel_size=(1, 1),
strides=self.strides)
else:
self.conv3 = None
self.bn1 = nn.BatchNorm(not self.training)
self.bn2 = nn.BatchNorm(not self.training)

def __call__(self, X):
Y = nn.relu(self.bn1(self.conv1(X)))
Y = self.bn2(self.conv2(Y))
if self.conv3:
X = self.conv3(X)
Y += X
return nn.relu(Y)

8

Then, we add all the modules to ResNet. Here, two residual blocks are used for each module. Lastly, just like GoogLeNet, we add a global average pooling layer, followed by the fully connected layer output.

import jax
from flax import linen as nn
from jax import numpy as jnp
from d2l import jax as d2l

class Residual(nn.Module):  #@save
"""The Residual block of ResNet."""
num_channels: int
use_1x1conv: bool = False
strides: tuple = (1, 1)
training: bool = True

def setup(self):
self.conv1 = nn.Conv(self.num_channels, kernel_size=(3, 3),
self.conv2 = nn.Conv(self.num_channels, kernel_size=(3, 3),
if self.use_1x1conv:
self.conv3 = nn.Conv(self.num_channels, kernel_size=(1, 1),
strides=self.strides)
else:
self.conv3 = None
self.bn1 = nn.BatchNorm(not self.training)
self.bn2 = nn.BatchNorm(not self.training)

def __call__(self, X):
Y = nn.relu(self.bn1(self.conv1(X)))
Y = self.bn2(self.conv2(Y))
if self.conv3:
X = self.conv3(X)
Y += X
return nn.relu(Y)

9

import tensorflow as tf
from d2l import tensorflow as d2l

class Residual(tf.keras.Model):  #@save
"""The Residual block of ResNet."""
def __init__(self, num_channels, use_1x1conv=False, strides=1):
super().__init__()
kernel_size=3, strides=strides)
self.conv2 = tf.keras.layers.Conv2D(num_channels, kernel_size=3,
self.conv3 = None
if use_1x1conv:
self.conv3 = tf.keras.layers.Conv2D(num_channels, kernel_size=1,
strides=strides)
self.bn1 = tf.keras.layers.BatchNormalization()
self.bn2 = tf.keras.layers.BatchNormalization()

def call(self, X):
Y = tf.keras.activations.relu(self.bn1(self.conv1(X)))
Y = self.bn2(self.conv2(Y))
if self.conv3 is not None:
X = self.conv3(X)
Y += X
return tf.keras.activations.relu(Y)

0

import tensorflow as tf
from d2l import tensorflow as d2l

class Residual(tf.keras.Model):  #@save
"""The Residual block of ResNet."""
def __init__(self, num_channels, use_1x1conv=False, strides=1):
super().__init__()
kernel_size=3, strides=strides)
self.conv2 = tf.keras.layers.Conv2D(num_channels, kernel_size=3,
self.conv3 = None
if use_1x1conv:
self.conv3 = tf.keras.layers.Conv2D(num_channels, kernel_size=1,
strides=strides)
self.bn1 = tf.keras.layers.BatchNormalization()
self.bn2 = tf.keras.layers.BatchNormalization()

def call(self, X):
Y = tf.keras.activations.relu(self.bn1(self.conv1(X)))
Y = self.bn2(self.conv2(Y))
if self.conv3 is not None:
X = self.conv3(X)
Y += X
return tf.keras.activations.relu(Y)

1

import tensorflow as tf
from d2l import tensorflow as d2l

class Residual(tf.keras.Model):  #@save
"""The Residual block of ResNet."""
def __init__(self, num_channels, use_1x1conv=False, strides=1):
super().__init__()
kernel_size=3, strides=strides)
self.conv2 = tf.keras.layers.Conv2D(num_channels, kernel_size=3,
self.conv3 = None
if use_1x1conv:
self.conv3 = tf.keras.layers.Conv2D(num_channels, kernel_size=1,
strides=strides)
self.bn1 = tf.keras.layers.BatchNormalization()
self.bn2 = tf.keras.layers.BatchNormalization()

def call(self, X):
Y = tf.keras.activations.relu(self.bn1(self.conv1(X)))
Y = self.bn2(self.conv2(Y))
if self.conv3 is not None:
X = self.conv3(X)
Y += X
return tf.keras.activations.relu(Y)

2

There are 4 convolutional layers in each module (excluding the $$1\times 1$$ convolutional layer). Together with the first $$7\times 7$$ convolutional layer and the final fully connected layer, there are 18 layers in total. Therefore, this model is commonly known as ResNet-18. By configuring different numbers of channels and residual blocks in the module, we can create different ResNet models, such as the deeper 152-layer ResNet-152. Although the main architecture of ResNet is similar to that of GoogLeNet, ResNet’s structure is simpler and easier to modify. All these factors have resulted in the rapid and widespread use of ResNet. depicts the full ResNet-18.

Fig. 8.6.4 The ResNet-18 architecture.

Before training ResNet, let’s observe how the input shape changes across different modules in ResNet. As in all the previous architectures, the resolution decreases while the number of channels increases up until the point where a global average pooling layer aggregates all features.

import tensorflow as tf
from d2l import tensorflow as d2l

class Residual(tf.keras.Model):  #@save
"""The Residual block of ResNet."""
def __init__(self, num_channels, use_1x1conv=False, strides=1):
super().__init__()
kernel_size=3, strides=strides)
self.conv2 = tf.keras.layers.Conv2D(num_channels, kernel_size=3,
self.conv3 = None
if use_1x1conv:
self.conv3 = tf.keras.layers.Conv2D(num_channels, kernel_size=1,
strides=strides)
self.bn1 = tf.keras.layers.BatchNormalization()
self.bn2 = tf.keras.layers.BatchNormalization()

def call(self, X):
Y = tf.keras.activations.relu(self.bn1(self.conv1(X)))
Y = self.bn2(self.conv2(Y))
if self.conv3 is not None:
X = self.conv3(X)
Y += X
return tf.keras.activations.relu(Y)

3

import tensorflow as tf
from d2l import tensorflow as d2l

class Residual(tf.keras.Model):  #@save
"""The Residual block of ResNet."""
def __init__(self, num_channels, use_1x1conv=False, strides=1):
super().__init__()
kernel_size=3, strides=strides)
self.conv2 = tf.keras.layers.Conv2D(num_channels, kernel_size=3,
self.conv3 = None
if use_1x1conv:
self.conv3 = tf.keras.layers.Conv2D(num_channels, kernel_size=1,
strides=strides)
self.bn1 = tf.keras.layers.BatchNormalization()
self.bn2 = tf.keras.layers.BatchNormalization()

def call(self, X):
Y = tf.keras.activations.relu(self.bn1(self.conv1(X)))
Y = self.bn2(self.conv2(Y))
if self.conv3 is not None:
X = self.conv3(X)
Y += X
return tf.keras.activations.relu(Y)

4

import tensorflow as tf
from d2l import tensorflow as d2l

class Residual(tf.keras.Model):  #@save
"""The Residual block of ResNet."""
def __init__(self, num_channels, use_1x1conv=False, strides=1):
super().__init__()
kernel_size=3, strides=strides)
self.conv2 = tf.keras.layers.Conv2D(num_channels, kernel_size=3,
self.conv3 = None
if use_1x1conv:
self.conv3 = tf.keras.layers.Conv2D(num_channels, kernel_size=1,
strides=strides)
self.bn1 = tf.keras.layers.BatchNormalization()
self.bn2 = tf.keras.layers.BatchNormalization()

def call(self, X):
Y = tf.keras.activations.relu(self.bn1(self.conv1(X)))
Y = self.bn2(self.conv2(Y))
if self.conv3 is not None:
X = self.conv3(X)
Y += X
return tf.keras.activations.relu(Y)

3

import tensorflow as tf
from d2l import tensorflow as d2l

class Residual(tf.keras.Model):  #@save
"""The Residual block of ResNet."""
def __init__(self, num_channels, use_1x1conv=False, strides=1):
super().__init__()
kernel_size=3, strides=strides)
self.conv2 = tf.keras.layers.Conv2D(num_channels, kernel_size=3,
self.conv3 = None
if use_1x1conv:
self.conv3 = tf.keras.layers.Conv2D(num_channels, kernel_size=1,
strides=strides)
self.bn1 = tf.keras.layers.BatchNormalization()
self.bn2 = tf.keras.layers.BatchNormalization()

def call(self, X):
Y = tf.keras.activations.relu(self.bn1(self.conv1(X)))
Y = self.bn2(self.conv2(Y))
if self.conv3 is not None:
X = self.conv3(X)
Y += X
return tf.keras.activations.relu(Y)

6

import tensorflow as tf
from d2l import tensorflow as d2l

class Residual(tf.keras.Model):  #@save
"""The Residual block of ResNet."""
def __init__(self, num_channels, use_1x1conv=False, strides=1):
super().__init__()
kernel_size=3, strides=strides)
self.conv2 = tf.keras.layers.Conv2D(num_channels, kernel_size=3,
self.conv3 = None
if use_1x1conv:
self.conv3 = tf.keras.layers.Conv2D(num_channels, kernel_size=1,
strides=strides)
self.bn1 = tf.keras.layers.BatchNormalization()
self.bn2 = tf.keras.layers.BatchNormalization()

def call(self, X):
Y = tf.keras.activations.relu(self.bn1(self.conv1(X)))
Y = self.bn2(self.conv2(Y))
if self.conv3 is not None:
X = self.conv3(X)
Y += X
return tf.keras.activations.relu(Y)

7

import tensorflow as tf
from d2l import tensorflow as d2l

class Residual(tf.keras.Model):  #@save
"""The Residual block of ResNet."""
def __init__(self, num_channels, use_1x1conv=False, strides=1):
super().__init__()
kernel_size=3, strides=strides)
self.conv2 = tf.keras.layers.Conv2D(num_channels, kernel_size=3,
self.conv3 = None
if use_1x1conv:
self.conv3 = tf.keras.layers.Conv2D(num_channels, kernel_size=1,
strides=strides)
self.bn1 = tf.keras.layers.BatchNormalization()
self.bn2 = tf.keras.layers.BatchNormalization()

def call(self, X):
Y = tf.keras.activations.relu(self.bn1(self.conv1(X)))
Y = self.bn2(self.conv2(Y))
if self.conv3 is not None:
X = self.conv3(X)
Y += X
return tf.keras.activations.relu(Y)

8

import tensorflow as tf
from d2l import tensorflow as d2l

class Residual(tf.keras.Model):  #@save
"""The Residual block of ResNet."""
def __init__(self, num_channels, use_1x1conv=False, strides=1):
super().__init__()
kernel_size=3, strides=strides)
self.conv2 = tf.keras.layers.Conv2D(num_channels, kernel_size=3,
self.conv3 = None
if use_1x1conv:
self.conv3 = tf.keras.layers.Conv2D(num_channels, kernel_size=1,
strides=strides)
self.bn1 = tf.keras.layers.BatchNormalization()
self.bn2 = tf.keras.layers.BatchNormalization()

def call(self, X):
Y = tf.keras.activations.relu(self.bn1(self.conv1(X)))
Y = self.bn2(self.conv2(Y))
if self.conv3 is not None:
X = self.conv3(X)
Y += X
return tf.keras.activations.relu(Y)

9

import tensorflow as tf
from d2l import tensorflow as d2l

class Residual(tf.keras.Model):  #@save
"""The Residual block of ResNet."""
def __init__(self, num_channels, use_1x1conv=False, strides=1):
super().__init__()
kernel_size=3, strides=strides)
self.conv2 = tf.keras.layers.Conv2D(num_channels, kernel_size=3,
self.conv3 = None
if use_1x1conv:
self.conv3 = tf.keras.layers.Conv2D(num_channels, kernel_size=1,
strides=strides)
self.bn1 = tf.keras.layers.BatchNormalization()
self.bn2 = tf.keras.layers.BatchNormalization()

def call(self, X):
Y = tf.keras.activations.relu(self.bn1(self.conv1(X)))
Y = self.bn2(self.conv2(Y))
if self.conv3 is not None:
X = self.conv3(X)
Y += X
return tf.keras.activations.relu(Y)

8

## 8.6.4. Training

We train ResNet on the Fashion-MNIST dataset, just like before. ResNet is quite a powerful and flexible architecture. The plot capturing training and validation loss illustrates a significant gap between both graphs, with the training loss being significantly lower. For a network of this flexibility, more training data would offer significant benefit in closing the gap and improving accuracy.

blk = Residual(3)
X = torch.randn(4, 3, 6, 6)
blk(X).shape

1

blk = Residual(3)
X = torch.randn(4, 3, 6, 6)
blk(X).shape

2

blk = Residual(3)
X = torch.randn(4, 3, 6, 6)
blk(X).shape

2

blk = Residual(3)
X = torch.randn(4, 3, 6, 6)
blk(X).shape

4

## 8.6.5. ResNeXt

One of the challenges one encounters in the design of ResNet is the trade-off between nonlinearity and dimensionality within a given block. That is, we could add more nonlinearity by increasing the number of layers, or by increasing the width of the convolutions. An alternative strategy is to increase the number of channels that can carry information between blocks. Unfortunately, the latter comes with a quadratic penalty since the computational cost of ingesting $$c_i$$ channels and emitting $$c_o$$ channels is proportional to $$\mathcal{O}(c_i \cdot c_o)$$ (see our discussion in ).

We can take some inspiration from the Inception block of which has information flowing through the block in separate groups. Applying the idea of multiple independent groups to the ResNet block of led to the design of ResNeXt (). Different from the smorgasbord of transformations in Inception, ResNeXt adopts the same transformation in all branches, thus minimizing the need for manual tuning of each branch.

Fig. 8.6.5 The ResNeXt block. The use of grouped convolution with $$g$$ groups is $$g$$ times faster than a dense convolution. It is a bottleneck residual block when the number of intermediate channels $$b$$ is less than $$c$$.

Breaking up a convolution from $$c_i$$ to $$c_o$$ channels into one of $$g$$ groups of size $$c_i/g$$ generating $$g$$ outputs of size $$c_o/g$$ is called, quite fittingly, a grouped convolution. The computational cost (proportionally) is reduced from $$\mathcal{O}(c_i \cdot c_o)$$ to $$\mathcal{O}(g \cdot (c_i/g) \cdot (c_o/g)) = \mathcal{O}(c_i \cdot c_o / g)$$, i.e., it is $$g$$ times faster. Even better, the number of parameters needed to generate the output is also reduced from a $$c_i \times c_o$$ matrix to $$g$$ smaller matrices of size $$(c_i/g) \times (c_o/g)$$, again a $$g$$ times reduction. In what follows we assume that both $$c_i$$ and $$c_o$$ are divisible by $$g$$.

The only challenge in this design is that no information is exchanged between the $$g$$ groups. The ResNeXt block of amends this in two ways: the grouped convolution with a $$3 \times 3$$ kernel is sandwiched in between two $$1 \times 1$$ convolutions. The second one serves double duty in changing the number of channels back. The benefit is that we only pay the $$\mathcal{O}(c \cdot b)$$ cost for $$1 \times 1$$ kernels and can make do with an $$\mathcal{O}(b^2 / g)$$ cost for $$3 \times 3$$ kernels. Similar to the residual block implementation in , the residual connection is replaced (thus generalized) by a $$1 \times 1$$ convolution.

The right figure in provides a much more concise summary of the resulting network block. It will also play a major role in the design of generic modern CNNs in . Note that the idea of grouped convolutions dates back to the implementation of AlexNet (). When distributing the network across two GPUs with limited memory, the implementation treated each GPU as its own channel with no ill effects.

The following implementation of the

torch.Size([4, 3, 6, 6])

0 class takes as argument
torch.Size([4, 3, 6, 6])

1 ($$g$$), with
torch.Size([4, 3, 6, 6])

2 ($$b$$) intermediate (bottleneck) channels. Lastly, when we need to reduce the height and width of the representation, we add a stride of $$2$$ by setting
torch.Size([4, 3, 6, 6])

3.

blk = Residual(3)
X = torch.randn(4, 3, 6, 6)
blk(X).shape

5

blk = Residual(3)
X = torch.randn(4, 3, 6, 6)
blk(X).shape

6

blk = Residual(3)
X = torch.randn(4, 3, 6, 6)
blk(X).shape

7

blk = Residual(3)
X = torch.randn(4, 3, 6, 6)
blk(X).shape

8

Its use is entirely analogous to that of the

torch.Size([4, 3, 6, 6])

4 discussed previously. For instance, when using (
torch.Size([4, 3, 6, 6])

5), the input and output are of the same shape. Alternatively, setting
torch.Size([4, 3, 6, 6])

3 halves the output height and width.

blk = Residual(3)
X = torch.randn(4, 3, 6, 6)
blk(X).shape

9

/home/d2l-worker/miniconda3/envs/d2l-en-release-1/lib/python3.9/site-packages/torch/nn/modules/lazy.py:178: UserWarning: Lazy modules are a new feature under heavy development so changes to the API or functionality can happen at any moment.
warnings.warn('Lazy modules are a new feature under heavy development '

0

/home/d2l-worker/miniconda3/envs/d2l-en-release-1/lib/python3.9/site-packages/torch/nn/modules/lazy.py:178: UserWarning: Lazy modules are a new feature under heavy development so changes to the API or functionality can happen at any moment.
warnings.warn('Lazy modules are a new feature under heavy development '

1

/home/d2l-worker/miniconda3/envs/d2l-en-release-1/lib/python3.9/site-packages/torch/nn/modules/lazy.py:178: UserWarning: Lazy modules are a new feature under heavy development so changes to the API or functionality can happen at any moment.
warnings.warn('Lazy modules are a new feature under heavy development '

2

/home/d2l-worker/miniconda3/envs/d2l-en-release-1/lib/python3.9/site-packages/torch/nn/modules/lazy.py:178: UserWarning: Lazy modules are a new feature under heavy development so changes to the API or functionality can happen at any moment.
warnings.warn('Lazy modules are a new feature under heavy development '

3

/home/d2l-worker/miniconda3/envs/d2l-en-release-1/lib/python3.9/site-packages/torch/nn/modules/lazy.py:178: UserWarning: Lazy modules are a new feature under heavy development so changes to the API or functionality can happen at any moment.
warnings.warn('Lazy modules are a new feature under heavy development '

4

/home/d2l-worker/miniconda3/envs/d2l-en-release-1/lib/python3.9/site-packages/torch/nn/modules/lazy.py:178: UserWarning: Lazy modules are a new feature under heavy development so changes to the API or functionality can happen at any moment.
warnings.warn('Lazy modules are a new feature under heavy development '

5

/home/d2l-worker/miniconda3/envs/d2l-en-release-1/lib/python3.9/site-packages/torch/nn/modules/lazy.py:178: UserWarning: Lazy modules are a new feature under heavy development so changes to the API or functionality can happen at any moment.
warnings.warn('Lazy modules are a new feature under heavy development '

6

## 8.6.6. Summary and Discussion

Nested function classes are desirable since they allow us to obtain strictly more powerful rather than also subtly different function classes when adding capacity. One way to accomplish this is by allowing additional layers to simply pass through the input to the output. Residual connections allow for this. As a consequence, this changes the inductive bias from simple functions being of the form $$f(\mathbf{x}) = 0$$ to simple functions looking like $$f(\mathbf{x}) = \mathbf{x}$$.

The residual mapping can learn the identity function more easily, such as pushing parameters in the weight layer to zero. We can train an effective deep neural network by having residual blocks. Inputs can forward propagate faster through the residual connections across layers. As a consequence, we can thus train much deeper networks. For instance, the original ResNet paper () allowed for up to 152 layers. Another benefit of residual networks is that it allows us to add layers, initialized as the identity function, during the training process. After all, the default behavior of a layer is to let the data pass through unchanged. This can accelerate the training of very large networks in some cases.

Prior to residual connections, bypassing paths with gating units were introduced to effectively train highway networks with over 100 layers (). Using identity functions as bypassing paths, ResNet performed remarkably well on multiple computer vision tasks. Residual connections had a major influence on the design of subsequent deep neural networks, both for convolutional and sequential nature. As we will introduce later, the Transformer architecture () adopts residual connections (together with other design choices) and is pervasive in areas as diverse as language, vision, speech, and reinforcement learning.

ResNeXt is an example for how the design of convolutional neural networks has evolved over time: by being more frugal with computation and trading it off with the size of the activations (number of channels), it allows for faster and more accurate networks at lower cost. An alternative way of viewing grouped convolutions is to think of a block-diagonal matrix for the convolutional weights. Note that there are quite a few such “tricks” that lead to more efficient networks. For instance, ShiftNet () mimicks the effects of a $$3 \times 3$$ convolution, simply by adding shifted activations to the channels, offering increased function complexity, this time without any computational cost.

A common feature of the designs we have discussed so far is that the network design is fairly manual, primarily relying on the ingenuity of the designer to find the “right” network hyperparameters. While clearly feasible, it is also very costly in terms of human time and there is no guarantee that the outcome is optimal in any sense. In we will discuss a number of strategies for obtaining high quality networks in a more automated fashion. In particular, we will review the notion of network design spaces that led to the RegNetX/Y models ().

## 8.6.7. Exercises

1. What are the major differences between the Inception block in and the residual block? How do they compare in terms of computation, accuracy, and the classes of functions they can describe?

2. Refer to Table 1 in the ResNet paper () to implement different variants of the network.

3. For deeper networks, ResNet introduces a “bottleneck” architecture to reduce model complexity. Try to implement it.

4. In subsequent versions of ResNet, the authors changed the “convolution, batch normalization, and activation” structure to the “batch normalization, activation, and convolution” structure. Make this improvement yourself. See Figure 1 in He et al. () for details.

5. Why can’t we just increase the complexity of functions without bound, even if the function classes are nested?

### Apa itu fungsi lambda dalam python?

Lambda expression di Python adalah sebuah ekspresi untuk membuat fungsi. Lambda sendiri berasal dari teori kalkulus, yakni Lambda Calculus yang dikenalkan oleh Alonzo Church di tahun 1930. Berkat lambda, kita bisa membuat fungsi tanpa nama atau dikenal juga dengan anonymous function.

### Apakah yang dimaksud dengan lambda atau anonymous function?

Lambda expressions dan anonymous function adalah fungsi yang tidak perlu diklarasikan tetapi dapat langsung digunakan sebagai sebuah ekspresi. 