Layer Normalization 리뷰
Problem
How to reduce the training time : normalize - BN
BN’s disadvantage : dependent on the mini-batch size, not obvious RNN
RNN require different statistics for different time-steps
Layer Normalization
정의 : ????
BN: normalizes the summed inputs to each hidden unit
transpose BN into LN by computing in a layer on a single training case
all the hidden units in a layer share the same normalization terms µ and sigma
same computation at training and test times
advantages
works well for RNN
it can be used in online regime with batch 1
Related work
best perform of BN-RNN by keeping independent nomalization statistics for each time step
initializing the gain parameter in BN-RNN layer to 0.1 make significant difference
Our work is also related to weight normalization. instead of the variance, the L2 norm of the incoming weights is used to normalize the summed inputs to a neuron
BN vs LN
uses the distribution of the summed input to a neuron over a mini-batch of training cases to compute a mean and variance which are then used to normalize the summed input to that neuron on each training case.
The normalization standardizes each summed input using its mean and its standard deviation across the training data
requires running averages of the summed input statistics
normalizes the summed inputs to each hidden unit over the training cases.
computing the mean and variance used for normalization from all of the summed inputs to the neurons in a layer on a single training case
the proposed method directly estimates the normalization statistics from the summed inputs to the neurons within a hidden layer
by fixing the mean and the variance of the summed inputs within each layer.
We, thus, compute the layer normalization statistics over all the hidden units in the same layer