논문 리뷰

Layer Normalization 리뷰

gmlee729 2024. 12. 4. 21:11

Problem

How to reduce the training time : normalize - BN

BN’s disadvantage : dependent on the mini-batch size, not obvious RNN

RNN require different statistics for different time-steps

 

Layer Normalization

정의 : ????

BN: normalizes the summed inputs to each hidden unit

transpose BN into LN by computing in a layer on a single training case

all the hidden units in a layer share the same normalization terms µ and sigma

same computation at training and test times

advantages

works well for RNN

it can be used in online regime with batch 1

 

Related work

best perform of BN-RNN by keeping independent nomalization statistics for each time step

initializing the gain parameter in BN-RNN layer to 0.1 make significant difference

Our work is also related to weight normalization. instead of the variance, the L2 norm of the incoming weights is used to normalize the summed inputs to a neuron

 

BN vs LN

uses the distribution of the summed input to a neuron over a mini-batch of training cases to compute a mean and variance which are then used to normalize the summed input to that neuron on each training case.

The normalization standardizes each summed input using its mean and its standard deviation across the training data

requires running averages of the summed input statistics

normalizes the summed inputs to each hidden unit over the training cases.

 

computing the mean and variance used for normalization from all of the summed inputs to the neurons in a layer on a single training case

the proposed method directly estimates the normalization statistics from the summed inputs to the neurons within a hidden layer

by fixing the mean and the variance of the summed inputs within each layer.

We, thus, compute the layer normalization statistics over all the hidden units in the same layer