Resnet(Deep Residual Learning for Image Recognition) 리뷰

논문 리뷰

Resnet(Deep Residual Learning for Image Recognition) 리뷰

gmlee729 2024. 12. 1. 11:38

주요 내용

1. 기본 개념

- 항등(에 근접한) 연산. 입출력이 거의 안바뀐다는 preconditioning. F(x)는 0행렬에 가까울 것. W initial을 0 평균으로 할테니까 최적화도 훨씬 쉬움

- Bottle neck

2. 개념 / 용어

- residual : 잔차. 회귀분석에서 실제값 - 추정값.

- the challenging ImageNet dataset all exploit “very deep” models with a depth of sixteen to thirty

- COCO : 데이터셋 이름. 주로 Object Detection(객체 검출, 감지), 이미지 분류

- saturate : 활성화 함수의 출력이 최대값이나 최소값에 가까워지는 현상(기울기가 0에 가까워 지는 것)

* non-saturating nonlinearity : 비선형성이 포화되지않음(ReLU). 계속 특성을 갖고 무한대로 커질 수 있음

* vanishing gradient와 차이 : 기울기 소실은 전체에서 역전파시 발생하고 satur는 특정 뉴런에서 순전파시 발생

satur는 기울기 소실의 원인이 될 수 있음(기울기가 0이 돼버리니까) - 학습을 못시키게됨

- degrade((질적인)저하..) : 깊이가 증가함에도 정확도가 get saturated and then 빠르게 degrade되는 문제.(상식적으로 말이 안됨. 최소한 같거나 더 좋아져야함)

- underlying mapping(기저 매핑) : 기존의 매핑..? 실제 학습해야하는 함수

- identitiy mapping : 입력값이 그대로 전달. 입력을 그대로 출력. ex) 9 x 1 = 9

- easy to optimize(최적화가 쉽다) : 수렴(converge)이 잘된다. 최적값을 빠르고 쉽게 찾는다.

- shortcut connection = skip connection :

- FLOP(FLoating point OPeration) : 부동소수점 연산 "횟수". 딥러닝에서 연산량. 작을수록 좋음

* FLOPs : 그냥 s 붙임

* FLOPS(FLoating point Operations Per Second) : 1초당 처리할 수 있는 연산량. 하드웨어 성능지표. 클수록 좋음

- -wise : - 별로 ~을 수행한다. ex) channel-wise : 채널별로 conv연산을 수행한다 ## 약간 about 같은 느낌

3. 느낀점

- 왜 천천히 내려가는게 효과적인가? 안정성 있고 좋은 것 같지만 DNN의 장점이 깊은 네트워크를 통해 표현력을 늘리는건데.

- "residual"이란 단어를 쓴 이유. 정의는 예상 - 실제인데 비슷한 모양이어서 그런가..?

- 저자 천재인가...

- bias를 생략해도 되는지 나중에 생각해보자. 영향이 없진 않을 것 같은데? 아예 0으로 가정하고 하는건가

Abstract

- 깊은 신경망은 학습이 어려움

- residual learning(잔차학습)은 깊은 신경망 학습을 쉽게해줌

* easy to optimize, 증가한 깊이에도 정확도 얻음

- 기존 VGG net보다 8배 깊음(152 layers). but still lower complexity

- ILSVRC 2015 1등(3.57% error), 이를 기반으로한 모델로 COCO competitions에서도 1등

* COCO dataset에서 28% 향상

Introduction

- 깊은 CNN은 이미지 분류에 매우 중요. 깊은 구조는 low/mid/high level 통합, level은 많은 레이어로 enriched

- 선도하는 ImageNet 결과는 모두 매우 깊은 모델(16~30층).

- 쌓는건 중요하지만, 학습도 그만큼 쉬운가? 답에 대한 장애는 기울기 소실/폭발. 이건 initialization, normalization, SGD로 다뤄짐

- 수렴 시작할때 degradation 문제 발생. 이건 오버피팅 때문이 아님. the more layers, the higer training error.

- 깊은 구조(얕은 구조를 복사, 나머지는 identity mapping한)는 얕은 구조보다 traning error가 높으면 안됨. 그러나 실험 결과는 그렇게 나옴.

- degradation 문제를 다루기위해 residual 구조 도입. easy to optimize than original

- 극단적으로, 잔차를 0으로 만드는게 identity mapping보다 쉬움

- Shortcut connections are skipping one or more layers. the shortcut connections simply perform identity mapping

* extra 파라미터나 계산 복잡성이 추가되지않음. 많은 수정 없이 쉽게 실행할 수 있음.

- Imagenet뿐만 아니라 CIFAR-10도 동일한 degradation 문제와 residual 추가했을때 나은 성능을 보임. 1000개 이상 레이어까지 해봄

Deep residual learning

- $\mathcal{H}(x)$에 근사하는 것보다 $\mathcal{H}(x)$ - x에 근사하게 하는 것. 최종 함수 모습은 $\mathcal{H}(x)$ + x

x를 입력으로하고 W(i는 블럭내 레이어 index)를 곱해서 나온 F에 x를 더하는 것. bias 생략됨. ReLU 생략됨.

- degradation에서 motivated된 idea. identity mapping layer를 추가한 깊은 모델이 얕은 모델모다 성능이 안나오는 degradation 문제는 identity mapping에 어려움이 있음을 시사

- 이러한 reformulation(residual)은 문제를 precondition함. 새로운 함수를 찾는 것보다 작은 변화(perturbation)를 찾는게 더 쉬움. 실험에서도 작은 response를 보임

- plain/residual network 비교시 같은 수의 파라미터, 깊이, 너비, computational cost.(except negligible element-wise addition)

- 단, x와 $\mathcal{F}(x)$의 dimension이 같아야함. 이를 위해 linear projection 필요

- Fucntion $\mathcal{F}(x)$의 형태는 가변적. 3개 이상의 레이어도 가능. 그러나 1개만 가졌을 때는 이점이 없었음.

- fc layer으로만 표현했지만 conv layer에도 적용 가능

- model architecture

* VGGnet보다 lower complexity, fewer filter. layer baseline has 3.6 billion FLOPs(18% of VGG)

- 모델 구조 너무 길어서 캡쳐 안함..

- dimension 증가시 2가지 방법(1. extra zero padded 2. 1x1 conv). 점선으로 표현됨. 그냥 shortcut은 실선.

- implemantation

* resize, 224x224 crop, horizontal filp, per-pixel mean subtracted, standard color augmentation, BN(after conv before activation), initialization W

* mini-batch size 256, learning rate 0.1 and divided by 10 when the error plateaus, 60 × 10^4
iterations, weight decay 0.0001, momentum 0.9

* don't use dropout

* in test, standard 10-crop testing. average the scores at multiple scalesn {224, 256, 384, 480, 640}

Experiment

- 뚝뚝 떨어지는 구간의 이유는? ## 개인적인 궁금

- degradation problem에서 BN에 의해 기울기소실이 다뤄지기때문에 이건 이유가 안됨. 또한 어느정도 작동하고있음.

- The reason for such optimization difficulties will be studied in the future

- ResNet에서 degradation problem이 잘 addressed 됨.

- 18-layer ResNet이 plain보다 빠르게 수렴했음. optimization을 쉽게해서 초기에 빠른 수렴 제공

- shortcut 비교(A zero-padding, B projection shortcut for increasing dimension(others identity), C all shortcut are projection). 큰 차이는 없었지만 성능(정확도)은 A<B<C 였음. 즉 projection shortcut은 별로 안중요. 그러나 C는 extra para가 생기므로 안쓸 것임

- Bottleneck architecture. training time에 대한 우려로 도입(due to practical considerations). 1x1 - 3x3 - 1x1. 1x1연산이 dimension 증감시킴. 3x3 연산을 작은 dimension을 갖도록.(도입 안한것과 시간 복잡도는 유사함). more efficient model.

- 18/34/50/101/152 layer 실험 결과 우리가 원하는 깊어질수록 성능 좋아지는 효과 달성. 이전 다른 모델보다 좋음

'논문 리뷰' 카테고리의 다른 글

Layer Normalization 리뷰 (6)	2024.12.04
Batch Normalization: Accelerating Deep Network Training by ReducingInternal Covariate Shift 리뷰 (7)	2024.12.04
논문 단어 (10)	2024.11.24
Alexnet(ImageNet Classification with Deep Convolutional Neural Networks) 리뷰 (4)	2024.11.16
Dropout(A Simple Way to Prevent Neural Networks fromOverfitting) 리뷰 (9)	2024.11.11

현재글Resnet(Deep Residual Learning for Image Recognition) 리뷰

머신러닝, 딥러닝

Mathematics for Machine Learning, 메타러닝, Dropout, meta learning, 머신러닝, on first-order meta-learning algorithms,

Today :
Yesterday :

머신러닝, 딥러닝