consider the model Y = a + b X + error
a key component of calculating the standard error of the estimate of b and its confidence interval is N * V(X). Tradeoffs between N and the Variance of X are exact. We can use this to examine the effect of (a) splitting X at its median or (b) using only the upper and lower 3rds of the distribution.
Note that no matter what the distribution of X, the usual regression provides an unbiased least-squares estimate of the coefficient b. In particular if we split the observations on the basis of X, to compute the mean of Y, if we also compute the mean of X within each subgroup and use that as the predictor values in a regression, we will still get an unbiased estimate of b, but with a different standard error. Comparing the standard error for the continuous X and the split X allows an examination of the effects of splitting.
(a) Median split. Let’s assume a standard normal distribution for illustration. If we split at the median, this will also be splitting at the mean. The mean value of X in the lower half of the distribution is -Sqrt[2/Pi] = -.8 and the mean for the top half of the distribution is Sqrt[2/Pi] = .8. The new variance is 2/Pi = .636. All the components of estimating the standard error of b will be the same for both the continuous and the split model except
N V(X) = N (for the standard normal)
in the continuous model will be replaced by
N .636V(X) = .636 N (for the standard normal)
This is the same proportion by which the r^2 will be reduced and this value has appeared in numerous articles criticizing the splitting of data.
(b) For the case of using the upper 1/3 and lower 1/3 of cases. For a standard normal distribution the mean of the lower 1/3 of the values of X is -1.09 and the mean for the upper half is then +1.09.
So the variance is 1.09^2 = 1.19. But we’ve also lost 1/3 of our cases.
hence, the term
N V(X) = N
is replaced in the thirds model by
(2/3) N 1.19 V(X) = .79 N
Thus, in terms of the standard error and the confidence interval width, using only the top 1/3 and bottom 1/3 of the data is not as destructive as median splits, but it is still a bad idea.
Furthermore, for modest sizes of N, the loss of 1/3 of the degrees of freedom might substantially increase the value of the critical t. In other words, the thirds model will have substantially less statistical power.
The message, repeated in numerous methodological articles, and well known by Pearson in 1900 is that (a) throwing away information about your variable is never a good idea and (b) throwing away observations in the middle of the distribution is never a good idea.