08 Oct 2022

INDEX

1. Goal
2. Introduction
3. Notation, preview
4. Intermediate quantity: delta
5. Four fundemental equations
6. Backpropagation algorithm

1. Goal

mlp의 역전파를 이루는 4개 수식을 이해해보자.
(Four fundemental equations behind backpropagation)
- \[\begin{gather}\delta^L = \nabla_aC \ \odot \ \sigma'(z^L) \tag{BP1} \\ \delta^l = \left(\left(w^{l+1}\right)^T\delta^{l+1}\right) \ \odot \ \sigma'(z^l) \tag{BP2}\\ \frac{\partial C}{\partial b_j^l} = \delta^l_j \tag{BP3} \\ \frac{\partial C}{\partial w_{jk}^l} = a^{l-1}_k\delta^l_j \tag{BP4} \end{gather}\]
- Be warned, though: you shouldn’t expect to instantaneously assimilate the equations. Such an expectation will lead to disappointment. In fact, the backpropagation equations are so rich that understanding them well requires considerable time and patience as you gradually delve deeper into the equations. The good news is that such patience is repaid many times over. And so the discussion in this section is merely a beginning, helping you on the way to a thorough understanding of the equations.
  - 본문의 내용이다. 한 번에 이해하려 하지 말라고 한다.
  - 처음 보는 입장에선 notation 만 봐도 어지러울 것이다. 그래도 계속 보다보면 이해되는 순간이 온다. 아마도..
- Along the way we’ll return repeatedly to the four fundamental equations, and as you deepen your understanding those equations will come to seem comfortable and, perhaps, even beautiful and natural
  - 좀 변태 같지만 그렇다고 한다
코드로 역전파 알고리즘을 구현해보자

2. Introduction

신경망의 학습이 이뤄지기 위해선 $\partial C/\partial w$를 구해야 한다. ($C$는 cost function이다)
어떻게 구할 수 있을까?
무식하게 모든 $\mathbf w$에 대해 $\lim_{h \rightarrow 0}\frac{L_{x}(\mathbf w+h) - L_x(\mathbf w-h)}{2h}$ 를 구할 수는 없다.
오차역전파 방식을 사용하면 수치미분을 쓰지 않고, 말 그대로 오차($\delta$)를 앞으로 보내면서 각 가중치들의 gradient 를 구할 수 있다.
어떻게 그게 가능한지 알아보자.

3. Notation, preview

이 글은 neuralnetworksanddeeplearning 이라는 사이트를 참고해서 쓰인 글이다. 표기 또한 원문을 그대로 따라갈 것이다.
이 챕터에선 feedforward 연산을 수식으로 정리하면서 앞으로 사용할 notation에 대해 소개한다.
다음 수식을 먼저 보고, notation에 대한 설명을 보는 것이 이해하기 쉽다.
- \[a_j^l = \sigma\left(b_j^l + \sum_k{w_{jk}^l}a_k^{l-1}\right) \tag5\]
  - $w_{jk}^l$ : $(l-1)$^th 층의 $k$^th 번째 뉴런에서 $l$^th 층의 $j$^th 뉴런으로 가는 가중치. mlp에서 노드들은 바로 이전 노드와 연결된다는 것이 자명하므로 $(l-1)$^th 층에서 온다는 것을 굳이 표기하진 않음.
  - $b_j^l$ : $l$^th 층의 $j$^th 뉴런의 bias
  - $a_j^l$ : $l$^th 층의 $j$^th 뉴런의 활성화 값(출력)
- $l$^th 층의 $j$^th 뉴런의 출력 $a_j^l$는 $(l-1)$^th 층의 activation(출력) 들과 연결된다.
vector와 matrix를 사용해서 표기를 간결하게 해보자.
- \[a^l = \sigma\left(w^la^{l-1}+b^l\right) \tag6\]
  - $w^l$ : $l$^th 층과 연결되는 모든 weight들 (matrix)
    $w^l$ 의 $j$ 행의 $k$ 열의 값이 $w_{jk}^l$ 임.
  - $b^l$ : $l$^th 층의 모든 bias들 (vector)
  - $a^l$ : $l$^th 층의 모든 활성화 값(출력)
  - We use the obvious notation $\sigma(v)$ to denote this kind of elementwise application of a function.
    - $\sigma(v)_j = \sigma(v_j)$
\[z_j^l = b_j^l + \sum_k{w_{jk}^l}a_k^{l-1} \tag7\]
$\odot$ 은 elementwise product 를 의미한다.

4. Intermediate quantity: delta

구해야 하는 것은 $\frac{\partial C}{\partial w_{jk}^l}$ 과 $\frac{\partial C}{\partial b_j^l}$ 이다.
- 이 2개를 구하기 위해서 intermediate quantity $\delta^l_j$ 를 도입한다.
- 이는 이는 $l$^th 층의 $j$^th 뉴런의 error 를 의미한다.
  - \[\delta^l_j = \frac{\partial C}{\partial z^l_j} \tag8\]
  - $z^l_j$ 는 특정 뉴런의 activation 을 거치기 전 결과값이다. - 식 (7) 참고
    이때 $z^l_j$ 에 약간의 변화($\Delta z^l_j$)를 준다고 생각해보자. ($\Delta z^l_j$ 는 0에 가까운 아주 작은 상수이다)
    그럼 $C$ 는 $\frac{\partial C}{\partial z^l_j} \Delta z^l_j$ 만큼 변할 것이다
  - 이때 $\frac{\partial C}{\partial z^l_j}$ 가 크다면, $C$ 의 변화량 또한 클 것이다.
    - 만약 $C$ 를 낮춰야 한다면, $\frac{\partial C}{\partial z^l_j}$ 의 반대 부호의 방향으로 가지는 $\Delta z^l_j$ 를 선택하면 되고, $C$ 를 꽤 많이 줄일 수 있을 것이다.
  - 반대로 $\frac{\partial C}{\partial z^l_j}$ 가 작다면 C의 변화량 또한 작아질 것이다.
    - $\Delta z^l_j$ 만큼 $z^l_j$ 가 변해도 $C$ 는 크게 변하지 않는다는 것이다.
    - 이를 반대로 생각하면, $z^l_j$ 가 이미 optimal 에 도달했다고도 해석할 수 있다
  - $\frac{\partial C}{\partial z^l_j}$ 가 크면 $z^l_j$ 가 개선될 여지가 존재한다는 것이고, 오차가 존재한다는 의미로 받아들일 수 있다. 반대로 $\frac{\partial C}{\partial z^l_j}$ 가 작으면 이미 $z^l_j$ 가 optimal 에 근접한 상태라고 생각할 수 있고, 그래서 오차가 적은 상태라고 생각할 수 있다.
  - 이런 관점에서 $\delta^l_j = \frac{\partial C}{\partial z^l_j}$ 를 특정 노드의 오차라고 생각할 수 있다
- 왜 $\frac{\partial C}{\partial a^l_j}$ 를 안 쓰고 $\frac{\partial C}{\partial z^l_j}$ 를 오차($\delta$)로 사용하는가? - 단순히 대수적으로 풀어쓰기 어렵기 때문임 (to make the presentation of backpropagation a little more algebraically complicated)
역전파는 이 error $\delta^l_j$를 구하고, 그리고 $\delta^l_j$를 통해서 $\frac{\partial C}{\partial w_{jk}^l}$ 과 $\frac{\partial C}{\partial b_j^l}$ 를 구하는 방식으로 작동한다.

5. Four fundemental equations

chain rule from multivariable calculus 를 알면 쉽게 이해할 수 있다
\[\delta^L_j = \frac{\partial C}{\partial a_j^L} \sigma'(z^L_j) \tag{BP1}\]
- \[\delta^l_j = \frac{\partial C}{\partial z^l_j} \tag8\]
- 위 8번 식을 변형하면
- \[\delta^l_j = \sum_k{\frac{\partial C}{\partial{a_k^L}} \frac{\partial{a_k^L}}{\partial z^l_j}} \tag9\]
- 이고, $j \neq k$ 이면 $\frac{\partial{a_k^L}}{\partial z^l_j}$ 가 0으로 사라진다. 따라서
- \[\delta^L_j = \frac{\partial C}{\partial a_j^L} \frac{\partial a_j^L}{\partial z_j^L}= \frac{\partial C}{\partial a_j^L} \sigma'(z^L_j) \tag{BP1}\]
- 이다.
(BP2) 의 컨셉은 $\delta^l$ 을 $\delta^{l+1}$ 에 대한 식으로 전개하겠다는 것이다
- 오차를 이전 layer 로 전파시킨다는 컨셉으로 이해하면 된다
- \[\begin{align}\delta^l &= \frac{\partial C}{\partial z^l_j} \tag{8} \\ &= \sum_k {\frac{\partial C}{\partial z_k^{l+1}} \frac{\partial z_k^{l+1}}{\partial z^l_j}} \notag \\ &= \sum_k {\frac{\partial z^{l+1}_k}{\partial z_k^l}} \delta^{l+1}_k \tag{10} \end{align}\]
- \[z_k^{l+1} = b_k^{l+1} + \sum_j{w^{l+1}_{kj} a^l_j} = b_k^{l+1} + \sum_j{w^{l+1}_{kj} \sigma(z_j^l)} \tag{11}\]
- \[{\frac{\partial z^{l+1}_k}{\partial z_k^l}} = w^{l+1}_{kj} \sigma'(z_j^l) \tag{12}\]
- \[\therefore \delta^l_j = \sum_k {w_{kj}^{l+1} \delta_k^{l+1} \sigma'(z^l_j)} \tag{BP2a}\]
  - component form 으로 썼을 뿐 (BP2) 식과 같은 의미이다
식 (BP1) 과 (BP2) 를 이용하면 모든 layer 에서 $\delta$ 를 계산할 수 있다. 귀납법처럼
이제 $\frac{\partial C}{\partial w_{jk}^l}$ 과 $\frac{\partial C}{\partial b_j^l}$ 를 구할 차례이다
- $\frac{\partial C}{\partial w_{jk}^l}$ 과 $\frac{\partial C}{\partial b_j^l}$ 는 그 가중치와 연결된 노드의 $\delta$ 를 통해 구할 수 있는데,
- \[\frac{\partial C}{\partial w_{jk}^l} = \frac{\partial C}{\partial z_j^l}\frac{\partial z_j^l}{\partial w_{jk}^l} = \delta_j^l \frac{\partial z_j^l}{\partial w_{jk}^l} = \delta_j^l a^{l-1}_k \tag{BP4}\]
  - \[\begin{gather} \frac{\partial z_j^l}{\partial w_{jk}^l} = a^{l-1}_k \tag{14} \\ \because z_j^l = b_j^l + \sum_k{w_{jk}^l}a^{l-1}_k \notag\end{gather}\]
- \[\frac{\partial C}{\partial b^l_j} = \frac{\partial C}{\partial z_j^l}\frac{\partial z_j^l}{\partial b^l_j} = \delta_j^l \frac{\partial z_j^l}{\partial b^l_j} = \delta_j^l \tag{BP3}\]

기본 4개 식에 대한 내용은 끝났다.
여기까지 왔다면 끝에서부터 $\delta^l$ 을 구해나가는 것이 더이상 peculiar 하게 느껴지지 않을 것이다
- If you think about the proof of backpropagation, the backward movement is a consequence of the fact that the cost is a function of outputs from the network
- 책에선 약간 당연하다는 듯이 표현하는데, 사실 꼭 backward 의 방향으로 계산해야 gradient 를 구할 수 있는 건 아니다
  - forward 방향으로도, 국소적 미분들로 graidnet 계산이 가능하긴 하다
    - 계산 그래프가 $x \rightarrow y \rightarrow z \rightarrow C$ 인 상황이라 생각해보자
    - 이때 gradient 를 forward 방향으로 국소적인 미분들을 곱해가며 구하게 된다면
      - \[\begin{align} \frac{\partial C}{\partial x} &= \frac{\partial y}{\partial x}\frac{\partial z}{\partial y}\frac{\partial C}{\partial z} \notag \\ \frac{\partial C}{\partial y} &= \frac{\partial z}{\partial y} \frac{\partial C}{\partial z} \notag \\ \frac{\partial C}{\partial z} &= \frac{\partial C}{\partial z} \notag \end{align}\]
      - 이렇게 구해지고 $\frac{\partial C}{\partial x}, \frac{\partial C}{\partial y}, \frac{\partial C}{\partial z}$ 마다 각각 한 번씩 탐색해줘야 한다. 한 번에 모든 변수의 gradient 를 계산할 수 없다
    - 반면 뒤에서부터 계산하면
      - \[\begin{align} \frac{\partial C}{\partial z} &= \frac{\partial C}{\partial z}\frac{\partial z}{\partial z} \notag \\ \frac{\partial C}{\partial y} &= \frac{\partial C}{\partial z} \frac{\partial z}{\partial y} \notag \\ \frac{\partial C}{\partial x} &= \frac{\partial C}{\partial y} \frac{\partial y}{\partial x} \notag \end{align}\]
      - 의 형태와 같이, 국소적 미분을 이전에 구해진 값에 계속 곱하기만 하면 된다. DP 에서 memoization 을 하는 것과 같다
        
        불필요하게 구해야 하는 값이 없고, 한 번의 탐색으로 모든 변수들의 gradient 를 계산할 수 있다
        
        이전에 계산된 값이 다음 계산에서 그대로 쓰임
  - 따라서 gradient 를 backward 방향으로 전파하는 것이 더 효율적이고 자연스러운 방식이다
    - 간단히 생각해보면, forward propagation 에서 생긴 오차이기에 반대 방향으로 전파시키는 것(추적하는 것)이 더 자연스러워 보인다

6. Backpropagation algorithm

Input x
- Set the corresponding activation $a^1$ for the input layer
Feedforward
- For each $l = 2, 3, \dots , L$ compute and save $z^l = w^la^{l-1} + b^l$ and $a^l = \sigma(z^l)$
  - $\frac{\partial C}{\partial w_{jk}^l}$ 를 구할 때에 필요하기 때문에 중간 결과들을 전부 저장해야 함
Output error
- $\delta^L$ : Compute the vector $\delta^L = \nabla_aC \ \odot \ \sigma’(z^L)$
Backpropagate the error
- $\delta^l = ((w^{l+1})^\intercal \delta^{l+1}) \ \odot \ \sigma’(z^l) $
Compute the gradient of the cost function
- Given by (BP3), (BP4)

실제 코드 구현을 보면서 이해해보자

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
def update_mini_batch(self, mini_batch, eta):
    """Update the network's weights and biases by applying
    gradient descent using backpropagation to a single mini batch.
    The ``mini_batch`` is a list of tuples ``(x, y)``, and ``eta``
    is the learning rate."""
    nabla_b = [np.zeros(b.shape) for b in self.biases]
    nabla_w = [np.zeros(w.shape) for w in self.weights]
    for x, y in mini_batch:
        delta_nabla_b, delta_nabla_w = self.backprop(x, y)
        nabla_b = [nb+dnb for nb, dnb in zip(nabla_b, delta_nabla_b)]
        nabla_w = [nw+dnw for nw, dnw in zip(nabla_w, delta_nabla_w)]
    self.weights = [w-(eta/len(mini_batch))*nw
                    for w, nw in zip(self.weights, nabla_w)]
    self.biases = [b-(eta/len(mini_batch))*nb
                    for b, nb in zip(self.biases, nabla_b)]

미니배치 안에 있는 x, y 데이터를 하나씩 넣어 backprop 을 진행하고 gradient 를 구한다. 전부 더하고 평균낸 뒤 update 한다.
- 누적해서 더해야 하므로 6, 7번 줄에서 nabla_b 와 nabla_w 를 선언한다
- 미니배치에서 x, y 를 하나씩 들고와서 gradient 를 구하는데, 그때 사용되는 backprop 함수는 다음과 같다.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
def backprop(self, x, y):
    """Return a tuple ``(nabla_b, nabla_w)`` representing the
    gradient for the cost function C_x.  ``nabla_b`` and
    ``nabla_w`` are layer-by-layer lists of numpy arrays, similar
    to ``self.biases`` and ``self.weights``."""
    nabla_b = [np.zeros(b.shape) for b in self.biases]
    nabla_w = [np.zeros(w.shape) for w in self.weights]
    # feedforward
    activation = x
    activations = [x] # list to store all the activations, layer by layer
    zs = [] # list to store all the z vectors, layer by layer
    for b, w in zip(self.biases, self.weights):
        z = np.dot(w, activation)+b
        zs.append(z)
        activation = sigmoid(z)
        activations.append(activation)
    # backward pass
    delta = self.cost_derivative(activations[-1], y) * \
        sigmoid_prime(zs[-1])
    nabla_b[-1] = delta
    nabla_w[-1] = np.dot(delta, activations[-2].transpose())
    # Note that the variable l in the loop below is used a little
    # differently to the notation in Chapter 2 of the book.  Here,
    # l = 1 means the last layer of neurons, l = 2 is the
    # second-last layer, and so on.  It's a renumbering of the
    # scheme in the book, used here to take advantage of the fact
    # that Python can use negative indices in lists.
    for l in xrange(2, self.num_layers):
        z = zs[-l]
        sp = sigmoid_prime(z)
        delta = np.dot(self.weights[-l+1].transpose(), delta) * sp
        nabla_b[-l] = delta
        nabla_w[-l] = np.dot(delta, activations[-l-1].transpose())
    return (nabla_b, nabla_w)

이 함수에서도 마찬가지로 nabla_b, nabla_b 를 선언한다.
- update_mini_batch 함수에서의 것과는 다르다.
- backprop 의 것에 집중하자
  - 9번 줄을 보면, $a^1$ 에 x 를 대입한다 - 1번
  - 10번에서 중간 결과값, 즉 각 레이어의 $a$ 와 $z$ 들을 저장할 list를 생성한다 - 2번 준비
  - 12번 줄의 loop 에서 feedforward propagation 을 수행하고 중간 값들을 전부 저장한다 - 2번
  - 18, 19번 줄은 (BP1) 을 의미한다. - 3번
    - cost_derivative 안에 activations[-1] 과 y 가 들어가는 이유는, quadratic cost 를 사용하기 때문이다
      - $\frac{\partial C}{\partial a_j^L} = (a^L_j - y_j)$
    - * 연산자는 elementwise product 를 수행한다
  - $\delta^L$ 로 nabla_b[-1], nabla_w[-1] 를 계산한다 - 5번
  - 28번 줄의 loop 에서 앞으로 가면서 4, 5번을 반복한다 - 역전파
전체적으로 보면 마지막 layer 에서부터, 해당 layer 의 $\delta$ 를 구하고 nabla_b, nabla_w 구하고 저장하고, $\delta$ 넘겨서 다음 layer ($l-1$) 의 $\delta$ 구하고.. 이걸 반복한다.
- Roughly speaking, the computational cost of the backward pass is about the same as the forward pass. And so the total cost of backpropagation is roughly the same as making just two forward passes through the network.
  - 본문의 내용이다. forward 한 번, backward 한 번 하니까 2배. 그럴듯해 보인다
다시 update_mini_batch 로 돌아와 12 ~ 15 번 줄을 보면, 이렇게 backprop 에서 구해진 gradient 들을 평균 내서 eta 를 곱하고 경사하강법으로 파라미터들을 update 한다.
- 아주 직관적인 방식의 구현이다. 앞서 나온 수식들을 그대로 재현해서 이해하기 쉽다
- mlp 의 경우엔 계산이 쉬우므로 이렇게도 구현이 가능한데, convolution 처럼 고차원의 복잡한 연산의 역전파는 어떻게 구현해야 할지도 궁금해진다

원문에서 생략한 부분이 꽤 된다. 수식 부분은 가능하면 원문을 보는 게 도움이 많이 된다. 나머지는.. 사실 나도 수식 부분 빼고는 대충 읽어서 잘 모름
- 그래도 읽을 만한 책이다. 부캠 끝나고 시간 나면 끝까지 읽어봐야지…

first draft: 2022.12.12 01:44

INDEX

1. Goal

2. Introduction

3. Notation, preview

4. Intermediate quantity: delta

5. Four fundemental equations

기본 4개 식에 대한 내용은 끝났다.

6. Backpropagation algorithm