Home > Machine Learning > The problem of SVM for Imbalance Data

The problem of SVM for Imbalance Data

I accidentally made some mistakes in my codes which lead to using almost 10 times of negative samples than positive samples for training a SVM classifier. The performance drops almost 30%, which is quite significant. The overwhelming negative samples biased the classification boundary. Here’s a nice paper in ECML 2004 that studies on this problem. The author summarizes three reasons on the cause of performance loss with imbalanced data.

1. Positive points lie further from the ideal boundary.

An intuitive way to think about this is using the example provided by the author: draw n randomly chosen numbers between 1 to 100 from a uniform distribution (because it’s uniform, at each draw, the chance of drawing 100 is 1/100), the chances of drawing a number close to 100 would improve with increasing of n(n/100).

2. Weakness of soft-margines

The punishment term (C) minimizes the associated errors. If C is not very large, then SVM simply classify everything as negatives, because the error on few positive examples are so small.

3. Imbalanced Support Vector Ratio.

With imbalanced training data, the ratio between the positive and negative support vectors also becomes more imbalanced. Therefore, it increase the chance that the neighborhood of a test example is dominated by negative support vectors, and is more likely to classify boundary point as negative.

Categories: Machine Learning
  1. No comments yet.
  1. No trackbacks yet.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

%d bloggers like this: