The problem of SVM for Imbalance Data
I accidentally made some mistakes in my codes which lead to using almost 10 times of negative samples than positive samples for training a SVM classifier. The performance drops almost 30%, which is quite significant. The overwhelming negative samples biased the classification boundary. Here’s a nice paper in ECML 2004 that studies on this problem. The author summarizes three reasons on the cause of performance loss with imbalanced data.
1. Positive points lie further from the ideal boundary.
An intuitive way to think about this is using the example provided by the author: draw n randomly chosen numbers between 1 to 100 from a uniform distribution (because it’s uniform, at each draw, the chance of drawing 100 is 1/100), the chances of drawing a number close to 100 would improve with increasing of n(n/100).
2. Weakness of soft-margines
The punishment term (C) minimizes the associated errors. If C is not very large, then SVM simply classify everything as negatives, because the error on few positive examples are so small.
3. Imbalanced Support Vector Ratio.
With imbalanced training data, the ratio between the positive and negative support vectors also becomes more imbalanced. Therefore, it increase the chance that the neighborhood of a test example is dominated by negative support vectors, and is more likely to classify boundary point as negative.