Alexnet(ImageNet Classification with Deep Convolutional Neural Networks) 리뷰

논문 리뷰

Alexnet(ImageNet Classification with Deep Convolutional Neural Networks) 리뷰

gmlee729 2024. 11. 16. 12:08

주요내용

3. 새롭게 배운거

- Local Response Normalization : 지금은 BN이 나오면서 잘 쓰지않음. layer내의 특정 뉴런값이 너무 클때 그 영향력을 줄이는 것(1, 100, 1이 출력이면 다음 레이어에서 거의 100만 반영되니까).

- stationarity of statistics : 정상성. 시간이 시나도 통계량이 바뀌지않음. 한 특징이 위치에 상관없이 다수 존재할 수 있음...

= time-invariant(continuous), shift-invariant(discrete)

## 픽셀이 움직여도 특성 보존

- stationarity(움직이지않음) : (주로 시계열) 데이터들이 시간에 관계 없이 데이터의 확률 분포는 일정하다는 가정

- locality of pixel dependencies : 지역성. 이미지에서 한 점과 의미있게 연결된 점들은 주변에 있는 점들로만 국한된다.

## CNN은 필터를 통해 이미지의 일부만 사용하므로 만족

- overlapping pooling : 겹침. 풀링 커널끼리 겹치면서 진행!

- receptive field : 수용 영역. 필터의 크기와 stride에 의해서 결정. 출력 레이어의 뉴런 하나에 영향을 미치는 입력 뉴런들의 공간크기. 작은 뉴런은 이미지의 작은 부분만 볼 수 있으며 작고 국소적인 특징에 민감. 큰 뉴런은 이미지의 더 많은 부분을 볼 수 있고 더 크고 전역적인 특징에 민감하고 복잡한 패턴/특징 잘 잡아내고 출력 이미지 해상도 낮아지고 크기도 더 작아지지만 계산비용 증가, 과적합 위험, local feature 간과 위험

## 그냥 1칸이 이미지 얼만큼을 나타내는가.(얼만큼의 이미지를 보는가) ex) 코 모양의 1칸

4. 느낀점

- AlexNet은 2012년 ILSVRC 우승. 최초로 CNN 방식의 딥러닝이 우승한 것

- 이론보다는 적용 사례인듯. 이런걸 썼더니 잘되더라(오버피팅, 에러율, 시간 등)

- "왜"가 궁금하긴 하다. 예를들어 하이퍼파라미터를 왜 그렇게 썼는지, 그 수치가 왜 좋은지 등등. 인공지능의 고질적인 문제와 관련있는건가

Model Discription

- ImageNet LSVRC-2010 데이터 활용 : 1.2 million images, 1000 classes

* error rate : top-1 37.5%, top-5 17%

- 적용 모델 : 60 million parameters, 650,000 neurons, 5 conv layers(some of them max-pooling), 3 fc layers, softmax

* Using GPU, dropout

- batch size 128. momentum 0.9, small weight decay 0 reduce error.

- initialize w from zero-mean gaussian distribution with sd 0.01, bias는 2,4,5 conv layer와 fc layer는 1, 나머지 0.

- lr : 모든 레이어에 같은 lr 썼음. 0.01로 initial. valid error stop improving할때 10으로 나눔. 종료까지 3번 감소

- 90 cycles. iteration 의미인가?

- 2개의 GTX 580(3GB)을 사용해 5~6일간 학습

- LSVRC-2012에는 error rate : top-5 15.3%(2등 26.2%)

Introduction

- 기존까지는 데이터셋이 작았고 일반화되지도 않았음. 간단한 데이터셋(MNIST, CIFAR 등 - 1만개 정도의 이미지)은 잘 해결했음.

- 엄청난 복잡함이 문제였고 이를 해결하기위해 prior knowledge 가져야함.

- CNN 사용(깊이/넓이 다양하게 할 수 있고 이미지 환경(nature)을 잘 assumption. fewer connection and para.

* stationarity of statistics, stationarity, locality of pixel dependencies

- 고해상도 이미지로 expensive but GPU is powerful enough. and ImageNet contain enough labeled training examples prevent overfitting

- 깊이 중요. any conv layer(각 레이어는 전체 para의 1% 이하 보유) 제거하니까 inferior 성능

- network 크기는 GPU 메모리 양에 제한됨. GPU 성능에 따라 더 커지고 빨라질 수 있음

The Architecture

- 새롭고 일반적이지않은 특성으로 성능 향상, 시간 단축

- 나오는 순서는 중요도 순임.

1. ReLU : non-saturating nonlinearity. train several times faster tanh.

* saturating : 양 또는 음의 무한대로 갈수록 기울기가 0으로 수렴(sigmoid). non이면 무한대로 뻗어갈수 있음(relu)

2. GPU : GPU 2개 사용(gtx 580). spread the net across two GPUs. Cross-GPU parallelization. -

reduce error rates.

## 나중에 추가로 공부할 것

3. Local Response Normalization : aid generalization. 인접 n개의 커널맵 값 활용. lateral inhibition. ReLU 적용 이후에 적용

"brightness normalization". reduce top-1 1.4%, top-5 1.2%.

뭔말인지 모르겠음. 잘 몰라도 될듯.. 다른 리뷰에서도 그냥 넘어감.(BN, LN 쓰면 되니까..)

4. Overlapping Pooling : summarize the outputs of neighboring of neurons

reduce top-1 1.4%, top-5 1.2%.

overfit을 어렵게함.

5. overall architecture : 2,3,4,5 conv layer는 같은 GPU. r-normalization은 1,2 conv 후에. maxpooling은 r-normalization과 5 conv layer. relu는 전부에 적용

크기는 224 x 224 x 3

	raw	conv-1	conv-2	conv-3	conv-4	conv-5
필터		11x11x3x96	5x5x96x256	3x3x256x384	3x3x384x384	3x3x384x256
결과	224x224x3	55x55x96	27x27x256	13x13x384	13x13x384	13x13x256
적용		stride 4 max-pool, LRN	max-pool, LRN	GPU 분리 안함		max-pool

fc1 : 4096 - fc2 4096 - output 1000

- 224 x 224 x 3

오버피팅 방지

: 60 million para.

1. Data augmentation : with little computation, no need to stored on disk. generated on CPU(while GPU is training previous image - computation free). image translation, horizontal reflection.

256x256에서 224x224를 임의 추출(가로 반사 같이). training set 2048배 늘어남(inter-dependent).

* a factor of : ~배(5배, 10배)

test에서는 224x224 5개 패치(4 모서리, 1 가운데)와 가로 반전으로 10개 패치를 평균내서 make a prediction(예측).

alter the intensities of RGB channels. 이미지에 PCA로 뽑아낸 주성분 곱하고 랜덤 변수와 ~~~ 뭔말인지 모르겠음

이미지의 주요 특징을 잡아내고 특성은 유지함(강도와 채도(밝기))

reduce top-1 1%

2. Dropout : 많은 다른 모델 결합은 error를 낮추는데 매우 좋지만 비쌈. dropout, 뉴런 출력을 0으로(p = 0.5). 그 뉴런은 forward/backward 모두 미참가. weight는 공유. reduce co-adaptation.

test에서는 모든 뉴런 사용하되 output에 p(0.5) 곱함.

첫번째 fclayer에 적용. 오버피팅 막았음. 수렴에 2배의 iteration 걸림

Discussion

depth 매우 중요. 1개라도 빠지면 top-1 2% 감소.

'논문 리뷰' 카테고리의 다른 글

Layer Normalization 리뷰 (6)	2024.12.04
Batch Normalization: Accelerating Deep Network Training by ReducingInternal Covariate Shift 리뷰 (7)	2024.12.04
Resnet(Deep Residual Learning for Image Recognition) 리뷰 (9)	2024.12.01
논문 단어 (10)	2024.11.24
Dropout(A Simple Way to Prevent Neural Networks fromOverfitting) 리뷰 (9)	2024.11.11

현재글Alexnet(ImageNet Classification with Deep Convolutional Neural Networks) 리뷰

머신러닝, 딥러닝

메타러닝, on first-order meta-learning algorithms, 머신러닝, meta learning, Mathematics for Machine Learning, Dropout,

Today :
Yesterday :

머신러닝, 딥러닝