## Rectifier Nonlinearities

There are multiple different choice of activation functions for a NN. Many work has shown that using Rectified linear unit (ReLU) helps improve discriminative performance.

The figure below shows few popular activation functions, including sigmoid, and tanh.

**sigmoid**: g(x) = 1 /(1+exp(-1)). The derivative of sigmoid function g'(x) = (1-g(x))g(x).

**tanh** : g(x) = sinh(x)/cosh(x) = ( exp(x)- exp(-x) ) / ( exp(x) + exp(-x) )

**Rectifier** (hard ReLU) is really a max function

g(x)=max(0,x)

Another version is Noise ReLU max(0, x+*N*(0, σ(x)). ReLU can be approximated by a so called *softplus* function (for which the derivative is the logistic functions):

g(x) = log(1+exp(x))

The derivative of hard ReLU is constant over two ranges x<0 and x>=0, for x>0, g’=1, and x<0, g’=0.

This recent icml paper has discussed the possible reasons that why ReLU sometimes outperform sigmoid function:

- Hard ReLU is naturally enforcing sparsity.
- The derivative of ReLU is constant, as compared to sigmoid function, for which the derivative dies out if we either increase x or decrease x.

## Exercising Sparse Autoencoder

Deep learning recently becomes such a hot topic across both academic and industry area. Guess the best way to learn some stuff is to implement them. So I checked the recent tutorial posted at

ACL 2012 + NAACL 2013 Tutorial: Deep Learning for NLP (without Magic)

and they have a nice ‘assignment‘ for whoever wants to learn for sparse autoencoder. So I get my hands on it, and final codes are here.

There are two main parts for an autoencoder: feedforward and backpropagation. The essential thing needs to be calculated is the “error term”, because it is going to decide the partial derivatives for parameters including both W and the bias term b.

You can think of autoencoder as an unsupervised learning algorithm, that sets the target value to be equal to the inputs. But why so, or that is, then why bother to reconstruct the signal? The trick is actually in the hidden layers, where small number of nodes are used (smaller than the dimension of the input data — the sparsity enforced to the hidden layer). So you may see autoencoder has this ‘vase’ shape.

Thus, the network will be forced to learn a compressed representation of the input. You can think it of learning some intrinsic structures of the data, that is concise, analog to the PCA representation, where data can be represented by few axis. To enforce such sparsity, the average activation value ( averaged across all training samples) for each node in the hidden layer is forced to equal to a small value close to zero (called sparsity parameters) . For every node, a KL divergence between the ‘expected value of activation’ and the ‘activation from training data’ is computed, and adding to both cost function and derivatives which helps to update the parameters (W & b).

After learning completed, the weights represent the signals ( think of certain abstraction or atoms) that unsupervised learned from the data, like below:

## The problem of SVM for Imbalance Data

I accidentally made some mistakes in my codes which lead to using almost 10 times of negative samples than positive samples for training a SVM classifier. The performance drops almost 30%, which is quite significant. The overwhelming negative samples biased the classification boundary. Here’s a nice paper in ECML 2004 that studies on this problem. The author summarizes three reasons on the cause of performance loss with imbalanced data.

*1. Positive points lie further from the ideal boundary.*

An intuitive way to think about this is using the example provided by the author: draw n randomly chosen numbers between 1 to 100 from a uniform distribution (because it’s uniform, at each draw, the chance of drawing 100 is 1/100), the chances of drawing a number close to 100 would improve with increasing of n(n/100).

*2. Weakness of soft-margines*

The punishment term (C) minimizes the associated errors. If C is not very large, then SVM simply classify everything as negatives, because the error on few positive examples are so small.

*3. Imbalanced Support Vector Ratio.*

With imbalanced training data, the ratio between the positive and negative support vectors also becomes more imbalanced. Therefore, it increase the chance that the neighborhood of a test example is dominated by negative support vectors, and is more likely to classify boundary point as negative.

## machine learning by MR

Holidays are always good time to slow down the pace and do some reflections. So for this time, I’m trying to randomly read something, not specific for solving any on hand problem, but just enjoy the taste of fruits from other researchers. So here are some readings for mapreduce & machine learning.

Many machine learning algorithms fit Kearnsʼ Statistical Query Model:

Linear regression, k-means, Naive Bayes, SVM, EM, PCA, backprop. These can all be written (exactly) in a summation, which leads to a linear speedup in the number of processors.

Some papers on Mapreduce:

– Map0reduce for Machine Learning on Multicore

– Mapreduce: Distributed Computing for machine Learning, 2006

– Large Language Models in Machine Translation

– Fast, easy, and cheap: construction of statistical machine translation models with mapreduce.

– Parallel implementations of word alignment tool

– Inducing Gazetteers for Named Entity Recognition by Large-scale Clustering of Dependency Relations

– Pairwise document similarity in Large Collections with Mapreduce

– Aligning needles in a Haystack: Paraphrase acquisition across the web

– Google news personalization: scalable online collaborative filtering(Assign users to clusters, and assign weights to stories based on the ratings of the users in that cluster)

## Random Forest in Python

milk is the machine learning package written in python. It also comes with a complimentary data set called milksets which includes several U.C.I machine learning dataset.

from milksets import wine

features,labels = wine.load()

features will be a 2d-numpy.ndarray of features (noSample * noFeatureDim) and labels will be a 1d-numpy.ndarray of labels starting at 0 through N-1 (independently of how the labels were coded in the original data).

Below is an example using milk -random forest to predict the labels for the wine data. Three classes, feature is a (178L, 13L) np-matrix. Sample with maker ‘0’ is the correct predictions, with maker ‘x’ is the incorrect prediction. It takes some time to do the prediction, the cross-validation accuracy = 0.943820224719.

## The triple of Clustering

PAKDD workshop on “Multi-view data, High-dimensionality, External Knowledge: Striving for a Unified Approach to Clustering”

Figure 1 – Concept map of major research themes in advanced data clustering

## Machine Learning Blogs

Here’s the OPML for some good machine learning blogs. Well this Q&A posts a lot others

Maybe I should continue collecting them 🙂