논문 리뷰

Batch Normalization: Accelerating Deep Network Training by ReducingInternal Covariate Shift 리뷰

gmlee729 2024. 12. 4. 21:11

Problem

Covariate shift

각 레이어마다 입력의 분포가 변함

the distribution of each layer’s inputs changes during training

레이어가 계속 새 분포에 적응해야하므로 문제 발생

The change in the distributions of layers’ inputs presents a problem because the layers need to continuously adapt to the new distribution

Saturating non-linearity, vanishing gradient

addressed by ReLU, careful initialization, small learning rates. But BN is more stable

Whitening remove ICS. However, requires normalization updated and reduces the effect of GD

NormalizationLoss / GD에 영향을 주지 못함. L은 일정한데 b만 계속 증가

GDNormalization 포함하면 해결. 그러나 Jacobian으로 계산이 너무 비싸짐. Simplification 필요

 

Batch Normalization

정의 : normalization step that fixes the means and variances of layer inputs

Simplification 첫번째는 각 scalar featurenormalize(d개의 dimension별 정규화)

simply normalize는 nonlinear을 제한시킴 ex) Sigmoid에서 0 근처는 linear

r B 파라미터 추가(scale & shift the normalized value) - 표현력 복구. nonlinearity 확보.

Simplification 두번째는 전체가 아닌 mini-batchmean and variance 사용

infer에서는 배치들의 평균의 평균, 분산의 평균 사용(Algoritm 2)

 

 

Advantages

dramatically accelerates the training

converge faster if inputs are whitened or normalization

ImageNet : increase lr, remove dropout, shuffle more, reduce L2,

regularizes the model and reduce need for Dropout

BN provide similar regulization as Dropout by random selection

reducing the dependence of gradients on their initial values

This allows us to use much higher learning rates without the risk of divergence

파라미터 작은 변화에 대한 증폭 방지, backpropa가 파라미터 크기에 unaffected,

not saturating nonlinearities

r, B의 사용으로 mean, varience 고정돼있지않음

differentiable, ICS less, identity transformation, preserve network capacity

CNN에서도 사용 가능. BConvolutional property 보존