Rectifier Nonlinearities
There are multiple different choice of activation functions for a NN. Many work has shown that using Rectified linear unit (ReLU) helps improve discriminative performance.
The figure below shows few popular activation functions, including sigmoid, and tanh.
sigmoid: g(x) = 1 /(1+exp(-1)). The derivative of sigmoid function g'(x) = (1-g(x))g(x).
tanh : g(x) = sinh(x)/cosh(x) = ( exp(x)- exp(-x) ) / ( exp(x) + exp(-x) )
Rectifier (hard ReLU) is really a max function
g(x)=max(0,x)
Another version is Noise ReLU max(0, x+N(0, σ(x)). ReLU can be approximated by a so called softplus function (for which the derivative is the logistic functions):
g(x) = log(1+exp(x))
The derivative of hard ReLU is constant over two ranges x<0 and x>=0, for x>0, g’=1, and x<0, g’=0.
This recent icml paper has discussed the possible reasons that why ReLU sometimes outperform sigmoid function:
- Hard ReLU is naturally enforcing sparsity.
- The derivative of ReLU is constant, as compared to sigmoid function, for which the derivative dies out if we either increase x or decrease x.