08 Oct 2022

INDEX

1. Goal

  1. mlp의 역전파를 이루는 4개 수식을 이해해보자.
    (Four fundemental equations behind backpropagation)
    • \[\begin{gather}\delta^L = \nabla_aC \ \odot \ \sigma'(z^L) \tag{BP1} \\ \delta^l = \left(\left(w^{l+1}\right)^T\delta^{l+1}\right) \ \odot \ \sigma'(z^l) \tag{BP2}\\ \frac{\partial C}{\partial b_j^l} = \delta^l_j \tag{BP3} \\ \frac{\partial C}{\partial w_{jk}^l} = a^{l-1}_k\delta^l_j \tag{BP4} \end{gather}\]
    • Be warned, though: you shouldn’t expect to instantaneously assimilate the equations. Such an expectation will lead to disappointment. In fact, the backpropagation equations are so rich that understanding them well requires considerable time and patience as you gradually delve deeper into the equations. The good news is that such patience is repaid many times over. And so the discussion in this section is merely a beginning, helping you on the way to a thorough understanding of the equations.

      • 본문의 내용이다. 한 번에 이해하려 하지 말라고 한다.
      • 처음 보는 입장에선 notation 만 봐도 어지러울 것이다. 그래도 계속 보다보면 이해되는 순간이 온다. 아마도..
    • Along the way we’ll return repeatedly to the four fundamental equations, and as you deepen your understanding those equations will come to seem comfortable and, perhaps, even beautiful and natural

      • 좀 변태 같지만 그렇다고 한다
  2. 코드로 역전파 알고리즘을 구현해보자

2. Introduction

3. Notation, preview

4. Intermediate quantity: delta

5. Four fundemental equations


6. Backpropagation algorithm

  1. Input x
    • Set the corresponding activation $a^1$ for the input layer
  2. Feedforward
    • For each $l = 2, 3, \dots , L$ compute and save $z^l = w^la^{l-1} + b^l$ and $a^l = \sigma(z^l)$
      • $\frac{\partial C}{\partial w_{jk}^l}$ 를 구할 때에 필요하기 때문에 중간 결과들을 전부 저장해야 함
  3. Output error
    • $\delta^L$ : Compute the vector $\delta^L = \nabla_aC \ \odot \ \sigma’(z^L)$
  4. Backpropagate the error
    • $\delta^l = ((w^{l+1})^\intercal \delta^{l+1}) \ \odot \ \sigma’(z^l) $
  5. Compute the gradient of the cost function
    • Given by (BP3), (BP4)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
def update_mini_batch(self, mini_batch, eta):
    """Update the network's weights and biases by applying
    gradient descent using backpropagation to a single mini batch.
    The ``mini_batch`` is a list of tuples ``(x, y)``, and ``eta``
    is the learning rate."""
    nabla_b = [np.zeros(b.shape) for b in self.biases]
    nabla_w = [np.zeros(w.shape) for w in self.weights]
    for x, y in mini_batch:
        delta_nabla_b, delta_nabla_w = self.backprop(x, y)
        nabla_b = [nb+dnb for nb, dnb in zip(nabla_b, delta_nabla_b)]
        nabla_w = [nw+dnw for nw, dnw in zip(nabla_w, delta_nabla_w)]
    self.weights = [w-(eta/len(mini_batch))*nw
                    for w, nw in zip(self.weights, nabla_w)]
    self.biases = [b-(eta/len(mini_batch))*nb
                    for b, nb in zip(self.biases, nabla_b)]
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
def backprop(self, x, y):
    """Return a tuple ``(nabla_b, nabla_w)`` representing the
    gradient for the cost function C_x.  ``nabla_b`` and
    ``nabla_w`` are layer-by-layer lists of numpy arrays, similar
    to ``self.biases`` and ``self.weights``."""
    nabla_b = [np.zeros(b.shape) for b in self.biases]
    nabla_w = [np.zeros(w.shape) for w in self.weights]
    # feedforward
    activation = x
    activations = [x] # list to store all the activations, layer by layer
    zs = [] # list to store all the z vectors, layer by layer
    for b, w in zip(self.biases, self.weights):
        z = np.dot(w, activation)+b
        zs.append(z)
        activation = sigmoid(z)
        activations.append(activation)
    # backward pass
    delta = self.cost_derivative(activations[-1], y) * \
        sigmoid_prime(zs[-1])
    nabla_b[-1] = delta
    nabla_w[-1] = np.dot(delta, activations[-2].transpose())
    # Note that the variable l in the loop below is used a little
    # differently to the notation in Chapter 2 of the book.  Here,
    # l = 1 means the last layer of neurons, l = 2 is the
    # second-last layer, and so on.  It's a renumbering of the
    # scheme in the book, used here to take advantage of the fact
    # that Python can use negative indices in lists.
    for l in xrange(2, self.num_layers):
        z = zs[-l]
        sp = sigmoid_prime(z)
        delta = np.dot(self.weights[-l+1].transpose(), delta) * sp
        nabla_b[-l] = delta
        nabla_w[-l] = np.dot(delta, activations[-l-1].transpose())
    return (nabla_b, nabla_w)


first draft: 2022.12.12 01:44