0x3C : data science fundamentals: The Kolmogorov-Smirnov test: an example

After introducing the basics about the Kolmogorov-Smirnov test, the plan is now to start from our beloved normal distribution with mean 0 and standard deviation 1, generate n samples of m variates, and use the Kolmogorov-Smirnov test to see for how many samples we can correctly conclude that they have been drawn from the normal distribution (null hypothesis). This is the code I wrote:

import numpy as np
from scipy.stats import norm
from scipy.stats import kstest

mean, sigma, mvariates = 0, 1, 100
np.random.seed(2)

nsamples, rejections = 1000, 0

for i in range(nsamples):
k, p = kstest(norm.rvs(mean, sigma, size=mvariates), 'norm')
if p < 0.05:
rejections = rejections + 1

print 'The null hypothesis is rejected', rejections, '/', nsamples, 'times \
at 5% confidence level.'

For large samples I expected the test to reject the null hypothesis about 5% of the times using a p=0.05 threshold, and this is indeed what happens. But I was surprised to see that even for the smallest samples containing only one variate the test rejects the null hypothesis approximately 5% of the times. It turns out that the sample size is one of the parameters of the kstest() function, and that p-values are computed correctly regardless of the sample size.

Addendum: the code above can be improved

2012-07-25

The Kolmogorov-Smirnov test: an example

No comments:

Post a Comment